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RESEARCH  OVERVIEW 


This  is  the  first  semiannual  report  on  this  contract./  Most  of  the  work  done  under  the  preceding  contract, 
entitled  “A  Coherent  VLSI  Design  Environment,"  has-been  reported  in  prior  years;  however  some  of  the 
papers  listed  here  and  a  little  of  the  work  pertains  to  that  contract,  which  formally  expired  December  31,  1987. 

The  purpose  of  the  present  contract  is  to  investigate  limiting  technologies  for  a  very  large  computer 
system,  one  which,  if  built  during  the  mid  1990’s,  would  be  so  large  that  the  nation  could  only  afford  one  or  two 
of  them.  Extrapolation  of  hardware  technology  trends  suggests  that  the  following  rough  parameters  would  be 
achievable: 


Computational  Power 

4xl015  FLOPS 

Power  Dissipation 

50+  MW 

Memory 

3xl015  bits 

Bisection  Capacity 

4xl018  bits/sec 

Number  of  Physical  Processors 

3xl08 

Number  of  Virtual  Processors 

1012 

Size 

14 -story  building 

The  purpose  of  this  contract  is  to  investigate  plausibility,  which  is  defined  as  the  step  before  feasibility.  Once 
plausibility  is  demonstrated,  then  separate  efforts  at  demonstrating  feasibility  and  then  design  and  manufacture 
would  be  appropriate.  It  is  estimated  that  the  cost  of  such  a  ^building-size  computed”  would  be  in  excess  of 

$1,000,000,000..  i 

c 

Six  critical  areas  were  identified,  and  work  is  in  progress  in  each  of  these  areas.  The  six  areas,  and  the 
MIT  faculty  members  who  are  working  in  each  of  the  areas,  are  listed  in  the  chart  on  the  next  page.  The  format 
of  this  report  is  based  on  this  table. 


The  work  in  circuits  is  a  combination  of  improved  techniques  in  waveform  bounding,  and  new  ideas  in 
LU  factorization  and  relaxation  methods.  It  has  been  found  that  in  the  context  of  highly  parallel  computation, 
the  Gauss-Jacobi  technique  is  never  inferior  to  the  Gauss-Seidel  technique. 


Prototype  circuits  for  a  proposed  processing  element  have  been  fabricated,  and  are  under  evaluation. 


The  investigation  of  communications  topology  and  related  architectural  concerns  centers  around  two 
projects.  In  one  of  them,  the  Message-Driven  Processor  is  the  object  of  study.  .A  prototype  network  design 
frame  has  been  fabricated  and  tested.  The  study  of  cache  techniques  and  dictionary  schemes  has  just  begun, 
with  the  addition  of  Prof.  Anant  Agarwal  to  the  faculty.  /  /  .  A  , 

—  A  r  j  ' 

In  the  systems  software  area,  the  effort  centers  on  an  operating  system  for  the  Message-Driven 
Processor.  Three  major  paradigms  for  resource  management  have  been  found:  object  replication,  proactive 
mechanisms,  and  elimination  of  dynamic  bottlenecks. 


The  work  on  algorithms  involves  highly  parallel  algorithms  for  a  variety  of  purposes,  including  graph 
theory.  Many  new  results  are  given  in  this  section  and  in  the  various  publications. 
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The  application  area  so  far  has  centered  around  parallel  simulation  of  a  large  analog  VLSI  chip.  The 
chip  considered  is  one  that  can  be  part  of  a  smart  sensor.  No  current  simulation  tools  can  handle  circuits  of 
such  complexity  and  scale. 

The  principal  activities  and  contributing  faculty  are  listed  here: 


Circuits 

Lance  A.  Glasser 

Thomas  F.  Knight,  Jr. 

Paul  Penfield,  Jr. 

Jacob  K.  White 

John  L.  Wyatt,  Jr. 

Processing  Elements 

William  J.  Dally 

Communications  Topology 
and  Routing  Algorithms 

Anant  Agarwal 

William  J.  Dally 

F.  Thomson  Leighton 
Charles  E.  Leiserson 

Systems  Software 

William  J.  Dally 

Algorithmic  Models 

F.  Thomson  Leighton 
Charles  E.  Leiserson 

Applications 

Charles  E.  Leiserson 

Jacob  K.  White 

John  L.  Wyatt,  Jr. 
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CIRCUITS 


Prof.  John  Wyatt,  D.  L.  Standley,  and  P.  O’Brien  have  been  finishing  up  our  previous  project  on  CAD 
and  beginning  a  new  one  on  parallel  simulation  methods  for  regular  arrays.  The  goal  of  this  last  phase  of  our 
CAD  project  has  been  to  extend  the  waveform  bounding  results  that  Penfield  et.  al.  developed  for  MOS  circuits 
so  that  they  work  for  high-speed  ECL  as  well.  Our  recent  work  consists  of  two  major  parts: 

•  macromodelling  of  ECL  logic  gates  acting  both  as  drivers  and  as  loads,  and 

•  delay  estimation  for  individual  nets  using  the  gate  macromodel  parameters  and  RC  tree  models  for  metal 
interconnect. 

The  success  of  the  macromodelling  approach  relies  on  repetitive  use  of  a  library  of  modelled  cells.  A  fixed 
computational  cost  (several  mainframe  CPU  hours  per  cell)  is  paid  to  obtain  parameter  values  for  the 
simplified  macromodels.  The  resultant  timing  estimates  are  typically  within  5%  to  10%  of  SPICE  and  are 
obtained  with  roughly  1000  times  less  CPU  time  per  run.  This  work  has  taken  a  very  practical  turn  and  has 
been  extensively  tested  on  an  industrial  ECL  process  and  cell  library.  It  is  now  in  use  in  American  industry. 

Prof.  Jacob  White  is  investigating  the  possibility  of  accelerating  the  transient  simulation  of  MOS  devices 
by  using  waveform  relaxation.  The  WR  algorithm  is  being  applied  to  the  sparsely-connected  system  of  algebraic 
and  ordinary  differential  equations  in  time  generated  by  standard  spatial  discretization  techniques.  It  was 
proved  that  the  WR  contracts  in  a  uniform  norm  on  a  model  of  the  device  simulation  problem,  and  the  result 
was  verified  on  a  one-dimensional  experiment*.  The  implementation  of  the  method  for  2-D  device 
simulation  is  in  progress. 

LU  factorization  of  Circuit  simulation  matrices  is  difficult  to  parallelize,  in  part  because  methods  like 
parallel  nested  dissection  are  ineffective  due  to  the  difficultly  of  finding  good  separators.  We  are  investigating  a 
more  direct  approach  to  exploiting  the  parallelism  in  sparse  matrix  solution.  In  particular,  we  are  examining  the 
interaction  between  sparse  matrix  data  structures  and  computer  memory  structure,  to  see  how  to  store  sparse 
matrices  for  effective  parallel  computation.  One  interesting  result  is  that  it  is  possible  to  store  matrices  in  a 
scattered  form  which  allows  for  access  of  the  sparse  entries  by  fast  indexing,  in  only  three  times  the  storage. 
This  data  structure  is  static,  constructed  once  the  matrix  structure  is  known,  and  is  therefore  very  attractive  for 
parallel  implementation. 

The  results  indicating  the  optimality  of  Gauss-Jacobi  over  Gauss-Seidel  relaxation  on  infinitely  many 
parallel  processors  has  been  extended,  mostly  to  connect  the  spectral  radius  of  the  iteration  matrices  to  their 
graphical  properties* .  In  addition,  the  results  are  being  extended  to  the  waveform  relaxation  case,  where  this 
limiting  result  seems  to  show  on  as  few  as  eight  processors. 

We  are  trying  to  improve  the  reliability  of  relaxation  methods  by  extracting  lines,  or  banded  matrices, 
from  a  given  sparse  matrix,  solving  the  bands  directly,  and  relaxing  on  the  rest  of  the  matrix.  This  approach  is 
efficient  because  band  matrices  can  be  solved  in  log  n  time  on  n  processors,  and  is  more  reliable  than 
standard  relaxation,  because  “less”  relaxation  is  being  used.  The  idea  has  been  tested  on  circuit  simulation 
matrices  and  converges  faster  and  more  often  than  standard  relaxation  on  all  examples  tried  so  far.  We  are 


*M.  Reichelt,  J.  White,  J.  Allen  and  F.  Odeh,  “Waveform  Relaxation  Applied  to  Transient  Device  Simulation,” 
1988  International  Symposium  on  Circuits  and  Systems.  Espoo,  Finland,  June  7-9, 1988. 

t  D.  Smart  and  J.  White  “Reducing  the  Parallel  Solution  Time  of  Sparse  Circuit  Matrices  using  Reordered 
Gaussian  Elimination  and  Relaxation,”  1988  International  Symposium  on  Circuits  and  Systems.  Espoo, 
Finland,  June  7-9, 1988. 


4 


presently  working  on  how  to  select  the  ordering  of  the  matrix  to  best  exploit  the  direct  solution  of  the  band,  and 
to  automatically  select  the  band  size.  In  addition,  we  are  working  on  implementing  the  method  on  the 
Connection  Machine. 

A  new  technique  for  simulating  switched-capadtor  filters  and  switching  power  supplies  in  steady  state 
was  developed  *.  The  idea  is  based  on  simulating  selected  cydes  of  the  high-frequency  clock  accurately  with 
a  standard  discretization  method,  and  pasting  together  the  selected  cydes  by  computing  the  low  frequency 
behavior  with  a  truncated  Fourier  series.  If  carefully  constructed,  the  nonlinear  system  that  must  be  solved  for 
the  Fourier  coeffkients  is  almost  linear  and  can  be  solved  rapidly  with  Newton’s  method. 

The  WRN  method  with  mesh  refinement  is  more  parallelizable  than  standard  WR  methods,  because 
most  of  the  computation  for  all  the  timesteps  in  a  given  waveform  can  be  computed  in  parallel.  Recent  work  on 
this  method  has  been  to  show  that  the  waveform  Newton  algorithm  converges  globally  when  applied  to  circuits 
with  nonlinear  capacitors  * . 

The  model  used  in  conventional  device  simulation  programs  is  based  on  the  drift-diffusion  model  of 
electron  transport,  and  this  model  does  not  accurately  predict  the  field  distribution  near  the  drain  in  small 
geometry  devices.  This  is  of  particular  importance  for  predicting  oxide  breakdown  due  to  penetration  by  “hot” 
electrons.  Two  approaches  for  improving  the  accuracy  of  the  drift-diffusion  model  of  electron  transport  are 
under  investigation.  The  first  approach  is  to  take  an  additional  moment  of  the  Boltzmann  equation,  which 
yields  a  system  of  equations  for  electron  transport  that  is  similar  to  the  drift-diffusion  model,  but  includes  the 
electron  energies.  The  model  is  referred  to  as  the  hydrodynamic  model  and  has  been  implemented  in  several 
simulators. 

However,  the  discretization  techniques  used  in  aese  simulators  create  instabilities  in  certain  cases.  In 
addition,  the  hydrodynamic  model  relies  on  knowing  electron  energy  relaxation  times,  which  are  not  easily 
measured.  For  this  reason,  the  Monte  Carlo  method  is  also  being  investigated.  This  is  a  microscopic  method  in 
which  many  particles  (thousands)  are  tracked  as  they  go  through  random  scatterings.  This  method  is 
computationally  very  expensive,  but  can  model  very  detailed  physics.  It  is,  however,  an  ideal  candidate  for 
parallel  computation  as  the  particles  can  be  tracked  independently,  being  coupled  only  through  the  electric 
field.  Implementation  of  a  Monte  Carlo  simulator  on  the  Connection  Machine  is  currently  under  investigation. 


*K.  Kundert,  J.  White,  A.  Sangiovanni-Vincentelli,  “A  Mixed  Frequency-Time  Approach  for  Finding  the  Steady 
State  Solution  of  Clocked  Analog  Circuits,”  1988  Custom  Integrated  Circuits  Conference.  Rochester, 
NY,  May  16-18, 1988. 

t  R.  Saleh,  J.  White,  A.  Sangiovanni-Vincentelli,  and  A.  R.  Newton,  “Accelerating  Relaxation  Algorithms  for 
Circuit  Simulation  Using  Waveform -Newton  and  Step-Size  Refinement,”  submitted  to  IEEE 


PROCESSING  ELEMENTS 


Prof.  Dally  and  Stuart  Fiske  have  been  developing  the  Reconfigurable  Arithmetic  Processor  (RAP), 
which,  as  described  in  our  last  report,  is  an  experiment  in  reducing  the  I/O  bandwidth  required  by  VLSI 
arithmetic  chips.  The  RAP  exploits  the  locality  inherent  in  equations  by  chaining  together  a  number  of  nibble- 
serial  arithmetic  units  using  a  switch  that  is  reconfigured  every  major  cycle.  The  use  of  nibble-serial  arithmetic 
allows  us  to  build  a  switch  in  l/64th  the  area  required  by  a  full  parallel  switch,  and  to  avoid  the  pipeline  latency 
of  parallel  pipelined  arithmetic  units. 

Stuart  Fiske  has  completed  a  simulation  study  of  six  benchmark  numerical  applications.  The  average 
performance  of  the  RAP  on  these  benchmarks  is  9MFLOPs,  45  %  of  the  20MFLOPs  peak  performance. 

Our  fixed  point  RAP  prototype  has  been  tested.  The  initial  version  was  missing  a  ground  connection  to 
the  output  register  decoder.  This  bug  prevented  us  from  reading  out  any  output  register  except  RO.  Working 
around  this  bug  (using  RO)  we  verified  proper  operation  of  the  input  registers,  the  crossbar  switch,  the 
multipliers,  and  the  adders.  A  revised  chip,  correcting  the  problem,  has  been  fabricated  and  tested. 

Stuart  Fiske  is  currently  designing  a  64-bit  floating-point  RAP  test  chip.  He  has  completed  the  circuit 
design  of  a  64-bit,  nibble-serial,  floating-point  adder/subtracter,  a  major  component  of  the  floating-point  RAP. 
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COMMUNICATIONS  TOPOLOGY  AND  ROUTING  ALGORITHMS 


The  Message  Driven  Processor  (MDP)  project  involves  Prof.  Dally,  Andrew  Chien,  Soha  Hassoun, 
Waldemar  Horwat,  Bill  Song,  Brian  Totty,  and  Scott  Wills.  A  number  of  changes  were  made  to  the  MDP 
architecture  over  the  past  six  months.  Following  a  visit  from  Intel  Engineer  Paul  Carrick,  a  bypass  path  was 
added  to  the  data  path  to  improve  performance.  Also,  the  controller  was  redesigned  and  the  trap  handling  logic 
was  changed  to  more  efficiently  handle  synchronization  using  futures.  A  behavioral  register  transfer  level 
(RTL-B)  simulation  has  been  implemented  to  test  these  changes.  We  have  also  implemented  a  structural  RTL 
simulator  (RTL-S)  to  debug  the  MDP  logic  equations  before  drawing  detailed  schematics  of  the  logic. 

Our  prototype  network  design  frame  chips  have  been  tested.  The  initial  ID  test  chip  functioned 
correctly  except  for  a  subtle  timing  bug  that  occasionally  causes  a  decrementer  malfunction.  Tests  of  the 
prototype  router  have  verified  the  remainder  of  the  router  logic,  the  low-voltage  I/O  pads,  and  the  use  of  a 
single  wire  for  request  and  acknowledge.  A  revised  router  chip  has  been  submitted  for  fabrication. 

The  current  research  by  Prof.  Agarwal  involves  highly  efficient  large-scale  shared-memory 
multiprocessing.  Memory  traffic  over  the  network  is  one  of  the  primary  concerns  in  such  systems.  To  minimize 
the  traffic  to  memory  over  the  communication  network,  we  intend  using  large  caches.  Unlike  previous  large- 
scale  efforts  (e.g.  the  IBM  RP3),  we  will  cache  shared  memory  blocks  in  addition  to  private  blocks.  Caching  not 
only  reduces  the  latency  of  memory  accesses  but  also  substantially  increases  the  effective  memory  bandwidth. 
To  maintain  cache  coherency,  we  intend  using  a  modified  directory  scheme.  This  scheme  does  not  need  a 
shared  bus  as  the  communication  medium;  shared  buses  are  necessary  in  snoopy-cache  systems  (e.g.,  the 
Berkeley  SPUR)  and  tend  to  limit  the  scalability  of  the  multiprocessor. 

He  have  already  shown  that  our  directory  schemes  are  as  efficient  as  snoopy  cache  methods  for 
maintaining  cache  consistency.  In  addition,  directory  schemes  have  the  potential  to  scale  far  beyond  the  limits 
of  snoopy  caches.  Some  early  indication  of  their  scalability  was  provided  by  simulations  using  address  traces  of 
small  multiprocessors  (4  processors).  Our  immediate  goal  is  to  evaluate  the  scalability  of  these  directory 
schemes  for  a  much  larger  number  of  processors. 
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SYSTEMS  SOFTWARE 


The  JOSS  operating  system  for  the  Message-Driven  Processor  (MDP),  described  in  our  last  report,  has 
been  extended  by  adding  new  code  for  heap  compaction,  fast  context  allocation,  late  binding  method  activation, 
object  migration  and  method  caching. 

Prof.  Dally,  Andrew  Chien,  and  Waldemar  Horwat  have  implemented  a  new  Concurrent  Smalltalk 
(CST)  programming  system.  This  system  will  be  used  to  experiment  with  fine-grain  message-passing 
application  programs.  The  initial  implementation,  written  in  Common  Lisp,  consists  of  a  prefix-CST  compiler 
that  generates  message-passing  intermediate  code  (I-Code),  an  I-code  interpreter,  and  an  instrumentation 
package.  The  system  runs  on  Symbolics  36XX  Lisp  Machines  and  Sun  workstations.  Using  this  system,  we 
have  written  and  simulated  many  simple  programs  to  gather  statistics  about  fine-grain  concurrent  programs. 
We  are  currently  working  on  coding  a  few  realistic  benchmark  programs  (e.g.,  a  particle-in-cell  (PIC) 
simulation).  To  determine  the  suitability  of  the  MDP  to  execute  this  class  of  programs,  Waldemar  Horwat  has 
written  a  code  generator  that  converts  I-Codes  to  MDP  assembly  language. 

We  have  identified  several  key  paradigms  for  resource  management  (RM)  in  a  message-passing 
computer  system  that  represent  generalizations  of  RM  schemes  that  have  appeared  in  other  computer  systems. 
We  have  uncovered  three  major  paradigms:  object  replication,  proactive  vs.  reactive  mechanisms,  and 
elimination  of  dynamic  bottlenecks. 

Object  replication  techniques  have  been  used  for  many  years  in  caching  immutable  objects  (instructions 
or  read-only  data).  In  this  case,  replication  is  done  on  the  basis  of  cache  lines.  Cache  coherency  protocols  in 
multiprocessors  support  replication  of  mutable  objects.  In  these  systems,  multiple  readers  but  only  one  writer  is 
allowed  for  a  given  memory  location.  Again,  the  replication  is  done  on  the  basis  of  a  cache  line.  We  have 
generalized  the  problem  and  are  considering  two  cases:  1)  replication  of  immutable  objects  and  2)  replication 
and  consistency  control  of  mutable  objects.  In  an  object  oriented  model,  these  decisions  are  made  on  a  per 
object  basis  (this  makes  more  sense)  and  we  can  make  these  decisions  dynamically. 

Resource  management  techniques  can  be  thought  of  as  proactive  or  reactive.  Proactive  techniques 
attempt  to  anticipate  future  events  and  provide  for  them.  Reactive  techniques  respond  to  actual  needs. 
Proactive  techniques  are  less  well  understood,  but  promise  significantly  better  performance. 

The  elimination  of  dynamic  bottlenecks  is  another  important  paradigm  in  concurrent  machines.  Such 
bottlenecks  appear  to  occur  often  in  code  loading.  Reduction  operations  (accumulations  or  histograms)  can 
also  give  rise  to  dynamic  bottlenecks.  Several  software  schemes  based  on  combining  have  been  designed  to 
reduce  the  impact  of  such  bottlenecks. 

In  all  of  these  paradigms,  optimization  depends  on  information  about  a  program’s  run  time  behavior. 
Some  of  this  information  may  be  gleaned  by  sophisticated  compilation  techniques.  However,  we  feel  that  there 
is  often  a  need  for  the  programmer  to  make  some  assertions  about  the  program. 
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ALGORITHMS 


Mark  Newman,  Johan  Hastad  and  Prof.  Leighton  continued  their  work  on  techniques  for  using  a 
hypercube  even  when  a  significant  portion  of  its  nodes  and  edges  are  faulty.  Previously,  a  fast,  local  and 
deterministic  algorithm  was  shown  to  reconfigure  the  working  nodes  of  a  hypercube  into  a  fully  functioning 
hypercube  of  only  one  lower  dimension  even  when  up  to  half  of  the  nodes  of  the  original  hypercube  contain 
random  faults.  These  reconfigurations  enabled  the  hypercube’s  working  nodes  to  simulate  a  wide  variety  of 
constant  degree  networks  with  only  constant  slowdown,  again  with  the  failure  of  up  to  half  of  the  nodes.  For 
example,  even  if  half  of  the  nodes  of  the  hypercube  have  random  malfunctions,  the  hypercube  can  still  simulate 
every  binary  tree  of  the  same  size  with  constant  slowdown. 

Recently,  these  results  were  extended  to  include  wire  faults,  as  well  as  scenarios  where  far  less  than  half 
of  the  hardware  malfunctions.  Progress  was  also  made  on  the  problem  of  reducing  the  congestion  encountered 
during  the  simulation  of  a  functioning  hypercube  by  a  faulty  hypercube.  It  is  hoped  that  the  new  work  will  lead 
to  a  simulation  algorithm  that  has  constant  congestion  independent  of  the  size  of  the  hypercube. 

Eric  Schwabe  and  Prof.  Leighton  are  continuing  their  work  on  space-efficient  techniques  for  queue 
management  in  very  large  scale  networks.  One  of  the  main  difficulties  in  designing  algorithms  for  current  large 
scale  parallel  machines  is  making  sure  that  the  capacities  of  the  local  memories  are  not  exceeded.  Schwabe  and 
Leighton  have  made  substantial  progress  towards  solving  this  problem  by  designing  a  general  scheme  for 
dynamically  reorganizing  memory  so  that  local  memory  constraints  are  never  exceeded  provided  that  global 
memory  constraints  are  not  exceeded.  The  latter  constraint  is,  of  course,  much  easier  to  insure  for  large  scale 
parallel  n.  chines  where  a  very  large  amount  of  memory  is  distributed  in  very  small  chunks  among  a  large 
number  of  processors.  The  new  scheme  is  simple,  real-time,  space-efficient,  deterministic  and  transparent  to 
the  programmer.  It  requires  only  that  the  total  hardware  of  the  system  (i.e.,  wires  and  total  memory)  exceed 
the  number  (not  size)  of  the  local  memories  being  managed  by  a  logarithmic  factor.  In  return,  the  scheme 
guarantees  an  arbitrarily  high  percentage  utilization  of  the  total  memory.  The  scheme  is  specifically  designed 
for  use  with  hypercube-related  networks,  and  works  well  in  both  worst-case  and  average-case  settings.  Results 
of  this  work  are  reported  in  . 

Working  with  coauthors  from  Yale,  Bellcore,  and  UMass- Amherst,  Prof.  Leighton  is  continuing  his 
research  in  network  simulations.  The  speed  with  which  one  network  can  simulate  another  is  one  good 
indication  of  the  relative  computing  power  of  the  networks.  For  example,  the  hypercube  can  simulate  a  mesh  of 
the  same  size  without  delay,  but  not  vice-versa.  Recently,  Prof.  Leighton  and  his  coauthors  showed  that  the 
hypcrcube  can  simulate  just  about  every  other  network  proposed  for  parallel  computation  with  only  constant 
slowdown.  Included  are:  all  binary  trees,  the  X-tree,  the  pyramid  networks,  all  grid  networks,  meshes  of  trees, 
cube-connected-cycles,  and  the  butterfly.  There  remained  the  question  of  whether  or  not  other  networks  such 
as  the  butterfly  had  the  same  property.  Recently,  Prof.  Leighton  and  coauthors  showed  that  this  was  not  the 
case  by  developing  general  lower  bound  techniques  for  showing  that  one  network  cannot  simulate  another  very 
efficiently.  For  example,  they  proved  that  networks  like  the  butterfly  and  shuffle-exchange  graph  cannot 
simulate  networks  like  the  X-tree,  pyramid  graph  or  mesh  without  a  log  N  factor  slowdown,  the  worst 
possible.  Results  of  this  work  can  be  found  in  I . 


*T.  Leighton  and  P.  Shor,  “Tight  Bounds  for  Minimax  Grid  Matching,  with  Applications  to  the  Average  Case 
Analysis  of  Algorithms,"  Com binat orica.  to  appear,  1987. 

f  S.  Bhatt,  F.  Chung,  J.  Hong,  T.  Leighton  and  A.  Rosenberg,  “Optimal  Simulations  by  Butterfly  Networks,” 
1988  ACM  Symposium  on  Theory  of  Computing,  to  appear. 
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Working  under  the  direction  of  Prof.  Leighton,  Seth  Mallitz  recently  resolved  an  open  question 
concerning  the  pagenumber  of  graphs  with  E  edges  by  showing  that  every  E  -  edge  graph  can  be  embedded  in 
a  book  with  0(E1/2)  pages  and  that  this  is  close  to  optimal.  Recently,  this  work  has  been  extended  further  to 
show  that  the  pagenumber  of  any  graph  with  genus  G  is  at  most  0(G1/2),  again  the  best  possible,  in  general. 
Determining  the  pagenumber  of  a  graph  has  relevance  to  problems  in  wire  routing  and  fault  tolerance,  and  has 
been  discussed  in  previous  research  reviews.  The  current  work  ’  substantially  improves  the  best  known 
previous  bounds  from  above  and  below. 

Working  with  Profs.  Finkelstein  (Northeastern)  and  Kleitman,  Prof.  Leighton  discovered  a  method  for 
efficiently  realizing  permutations  of  data  stored  in  VLSI  chips  using  bus  interconnections  with  a  small  number 
of  pins  per  chip.  The  work  resolves  the  question  left  open  in  the  recent  work  on  the  same  subject  by  Kilian, 
Kipnis  and  Leiserson,  and  provides  tight  bounds  on  the  number  of  pins  required  to  realize  permutations  with 
uniform  permutation  architectures.  Results  of  this  research  are  reported  in  1 . 

Prof.  Leiserson  and  Cindy  Phillips  have  extended  their  previous  results  for  parallel  graph  contraction. 
The  new  results  allow  any  bounded-degree  n-vertex  graph  to  be  contracted  in  time  0(lg  n  +  lg2  g)  on 
an  EREW  PRAM  where  g  is  the  largest  genus  of  any  connected  component  of  the  graph.  The  algorithm  does 
not  require  knowledge  of  the  genus.  Cindy  Phillips,  Guy  Blelloch,  and  Ajit  Agrawal  and  Robert  Krawitz  of 
Thinking  Machine  Corporation  have  implemented  on  the  Connection  Machine  a  set  of  techniques  for 
manipulating  large  dense  matrices  on  hypercube  multiprocessors.  In  particular,  they  consider  the  case  where 
there  are  fewer  processors  than  matrix  elements.  A  set  of  powerful  primitives  hide  the  problem  embedding 
from  the  user  while  allowing  efficient  manipulation  of  the  embedding.  Problems  such  as  dense  linear  systems 
solving  have  been  sped  up  by  an  order  of  magnitude. 

Prof.  Leiserson  and  Bruce  Maggs  have  discovered  a  class  of  point-to-point  networks  that  are  area 
universal  in  the  sense  that,  with  high  probability,  a  network  in  the  class  with  N  processors  has  area  0  ( N )  and 
can  simulate  in  O(log  N)  steps  each  message-step  of  any  shared-bus  network  of  area  0(N).  The 
simulation  is  optimal  because  a  point-to-point  network  may  require  fl(  log  N)  steps  to  simulate  one  step  of  a 
shared-bus  network.  The  area  universal  networks  are  based  on  the  fat-trees  of  Greenberg  and  Leiserson  and 
the  simulation  uses  a  randomized  message  routing  algorithm  based  on  the  butterfly  routing  algorithm  of 
Ranade. 

They  have  also  been  studying  a  model  for  parallel  computation  called  the  concun-ent-read  concurrent- 
write  distributed  random-access  machine  (CRCW  DRAM).  In  a  DRAM,  the  communication  requirements  of 
a  parallel  algorithm  can  be  evaluated.  A  conservative  DRAM  algorithm  is  one  whose  communication 
requirements  at  each  step  can  be  bounded  by  the  congestion  of  pointers  of  the  input  data  structure  across  cuts 
of  the  network.  Previously,  they  gave  a  lemma  that  showed  how  to  “shortcut”  pointers  in  a  data  structure  in  an 
exclusive-read  exclusive-write  (EREW)  DRAM  so  that  remote  processors  could  communicate  without  causing 
undue  congestion.  Recently  they  discovered  a  more  powerful  shortcut  lemma  for  the  CRCW  DRAM  model. 
Using  this  lemma  one  can  construct  conservative  CRCW  DRAM  algorithms  for  problems  such  as  finding  the 
minimum  cost  spanning  tree  of  graph  that,  with  high  probability,  require  0(  log  N)  steps.  The  fastest  known 
conservative  EREW  DRAM  algorithms  for  these  problems  require  0(  log2  N)  steps. 

Work  is  also  in  progress  on  several  other  problems,  although  definitive  results  have  not  yet  been 
achieved.  For  example,  Prof.  Leighton  is  working  with  Richard  Koch  on  the  asymptotic  analysis  of  butterfly 
routing  algorithms  like  those  used  by  the  BBN  Butterfly,  Prof.  Leighton  is  working  with  Bruce  Maggs  and 


*S.  Mallitz,  “E-edge  Graphs  Have  Pagenumber  0(E1/2),”  to  be  submitted,  March  1988. 

t  L.  Finkelstein,  D.  Kleitman  and  T.  Leighton,  “Applying  the  Classification  Theorem  for  Finite  Simple  Groups 
to  Minimize  Pin  Count  in  Uniform  Permutation  Architectures.”  1988  Aegean  Workshop  on  Computing. 
to  appear. 
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Satish  Rao  on  a  unified  approach  to  packet  routing  in  arbitrary  networks;  and  Prof.  Leighton  is  working  with 
Satish  Rao  on  the  development  of  a  general  theory  for  approximate  max-flow  min-cut  results  for 
multicommodity  fiows,  which  will  have  applications  to  a  variety  of  problems  including  VLSI  layout,  graph 
bisection,  crossing  number  and  packet  routing. 
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APPLICATIONS 


Prof.  Wyatt  is  beginning  a  new  project  on  the  parallel  simulation  of  large,  regular  analog  arrays.  The 
goal  is  to  produce  a  simulation  tool  that  can  be  used  for  the  design  of  smart  sensors  for  machine  vision,  such  as 
those  currently  being  developed  at  Carver  Mead’s  laboratory  at  Caltech.  These  chips  typically  consist  of  large 
arrays,  e.g.,  from  32x32  to  128x128,  of  moderately  simple  analog  cells  that  perform  a  collective  analog 
computation  by  communication  with  nearest  neighbors.  No  currently  available  simulation  tool  is  of  any  use  on 
circuits  of  this  size. 

Systems  of  this  type  have  two  features  that  should  influence  the  design  of  a  tailor-made  simulator.  One 
is  the  natural  hierarchy,  devices  in  circuits  in  cells  in  arrays.  The  other  is  the  designer’s  concern  with  circuit 
sensitivity  in  such  systems:  how  do  individual  component  variations  affect  overall  system  performance?  We  are 
studying  ways  to  adapt  the  classical  adjoint  network  approach  to  a  hierarchical  framework  to  solve  this  latter 
problem  with  acceptable  computational  efficiency. 

Prof.  Leiserson  and  Jeff  Siskind  have  been  investigating  efficiency  models  for  Prolog.  This  past  summer, 
an  implementation  of  Prolog  mat  performed  dependency-directed  backtracking  was  described.  The 
methodology  used  for  that  implementation  was  to  incrementally  unravel  a  Prolog  program  into  an  AND/OR 
goal  tree  and  represent  both  the  search  tree,  as  well  as  the  unification  bindings  at  its  leaf  nodes,  as  a  SAT 
problem  in  a  TMS.  This  approach  required  precomputing  ALL  potential  nogoods  corresponding  to  potential 
unification  violations  at  every  unraveling  step,  clearly  a  fundamental  inefficiency  flaw. 

An  alternative  methodology  is  being  pursued  in  a  new  implementation.  The  implementation  starts  off 
with  a  Prolog  interpreter  whose  structure  is  similar  to  conventional  implementations  and  whose  performance 
differs  from  such  implementations  by  only  a  (presumably  small)  constant  factor.  This  interpreter  is 
incrementally  modified  to  add  mechanisms  for  doing  Selective  Backtracking,  Lateral  Pruning  (Caching 
Nogoods),  Non-chronoiogical  Retraction,  and  Constraint  Propagation.  As  each  mechanism  is  added,  cue  is 
taken  to  minimize,  or  at  least  understand,  its  effect  on  efficiency  of  the  baseline  system.  Each  incremental 
change  can  be  independently  enabled  or  disabled  to  allow  experimental  analysis  of  the  benefits  accrued  by  each 
of  the  above  techniques,  in  isolation  and  in  combination  with  the  others. 
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A  METHOD  FOR  THE  DESIGN  OF  STABLE  LATERAL  INHIBITION 
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Department  of  Electrical  Engineering  and  Computer  Science 
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ABSTRACT 

In  the  analog  VLSI  implementation  of  neural  systems,  it  is 
sometimes  convenient  to  build  lateral  inhibition  networks  by  using 
a  locally  connected  on-chip  resistive  grid.  A  serious  problem 
of  unwanted  spontaneous  oscillation  often  arises  with  these 
circuits  and  renders  them  unusable  in  practice.  This  paper  reports 
a  design  approach  that  guarantees  such  a  system  will  be  stable, 
even  though  the  values  of  designed  elements  and  parasitic  elements 
in  the  resistive  grid  may  be  unknown.  The  method  is  based  on  a 
rigorous,  somewhat  novel  mathematical  analysis  using  Tellegen's 
theorem  and  the  idea  of  Popov  multipliers  from  control  theory.  It 
is  thoroughly  practical  because  the  criteria  are  local  in  the  sense 
that  no  overall  analysis  of  the  interconnected  system  is  required, 
empirical  in  the  sense  that  they  involve  only  measurable  frequency 
response  data  on  the  individual  cells,  and  robust  in  the  sense  that 
unmodelled  parasitic  resistances  and  capacitances  in  the  inter¬ 
connection  network  cannot  affect  the  analysis. 

I .  INTRODUCTION 

The  term  "lateral  inhibition"  first  arose  in  neurophysiology  to 
describe  a  common  form  of  neural  circuitry  in  which  the  output  of 
each  neuron  in  some  population  is  used  to  inhibit  the  response  of 
each  of  its  neighbors.  Perhaps  the  best  understood  example  is  the 
horizontal  cell  layer  in  the  vertebrate  retina,  in  which  lateral 
inhibition  simultaneously  enhances  intensity  edges  and  acts  as  an 
automatic  gain  control  to  extend  the  dynamic  range  of  the  retina 
as  a  whole  .  The  principle  has  been  used  in  the  design  of  artificial 
neural  system  algorithms  by  Kohonen^  and  others  and  in  the  electronic 
design  of  neural  chips  by  Carver  Mead  et.  al.3»4. 

In  the  VLSI  implementation  of  neural  systems,  it  is  convenient 
to  build  lateral  inhibition  networks  by  using  a  locally  connected 
on-chip  resistive  grid.  Linear  resistors  fabricated  in,  e.g., 
polysilicon,  yield  a  very  compact  realization,  and  nonlinear 
resistive  grids,  made  from  MOS  transistors,  have  been  found  useful 
for  image  segmentation. 4 »5  Networks  of  this  type  can  be  divided  into 
two  classes:  feedback  systems  and  feedforward-only  systems.  In  the 
feedforward  case  one  set  of  amplifiers  imposes  signal  voltages  or 


currents  on  the  grid  and  another  set  reads  out  the  resulting  response 
for  subsequent  processing,  while  the  same  amplifiers  both  "write"  to 
the  grid  and  "read”  from  it  in  a  feedback  arrangement.  Feedforward 
networks  of  this  type  are  inherently  stable,  but  feedback  networks 
need  not  be. 

A  practical  example  is  one  of  Carver  Mead's  retina  chips3  that 
achieves  edge  enhancement  by  means  of  lateral  inhibition  through  a 
resistive  grid.  Figure  1  shows  a  single  cell  in  a  continuous-time 
version  of  this  chip.  Note  that  the  capacitor  voltage  is  affected 
both  by  the  local  light  intensity  incident  on  that  cell  and  by  the 
capacitor  voltages  on  neighboring  cells  of  identical  design.  Any 
cell  drives  its  neighbors,  which  drive  both  their  distant  neighbors 
and  the  original  cell  in  turn.  Thus  the  necessary  ingredients  for 
instability — active  elements  and  signal  feedback — are  both  present 
in  this  system,  and  in  fact  the  continuous-time  version  oscillates 
so  badly  that  the  original  design  is  scarcely  usable  in  practice 
with  the  lateral  inhibition  paths  enabled.^*  Such  oscillations  can 


to  similar 
cells 


incident 

light 


Figure  1.  This  photoreceptor  and  signal  processor  circuit,  using  two 
MOS  transconductance  amplifiers,  realizes  lateral  inhibition  by 
communicating  with  similar  units  through  a  resistive  grid. 


readily  occur  in  any  resistive  grid  circuit  with  active  elements  and 
feedback,  even  when  each  individual  cell  is  quite  stable.  Analysis 
of  the  conditions  of  instability  by  straightforward  methods  appears 
hopeless,  since  any  repeated  array  contains  many  cells,  each  of 
which  influences  many  others  directly  or  indirectly  and  is  influenced 
by  them  in  turn,  so  that  the  number  of  simultaneously  active  feed¬ 
back  loops  is  enormous. 

This  paper  reports  a  practical  design  approach  that  rigorously 
guarantees  such  a  system  will  be  stable.  The  very  simplest  version 
of  the  idea  is  intuitively  obvious:  design  each  individual  cell  so 
that,  although  internally  active,  it  acts  like  a  passive  system  as 
seen  from  the  resistive  grid.  In  circuit  theory  language,  the 
design  goal  here  is  that  each  cell's  output  impedance  should  be  a 
positive- real"!  function.  This  is  sometimes  not  too  difficult  in 
practice;  we  will  show  that  the  original  network  in  Fig.  1  satisfies 
this  condition  in  the  absence  of  certain  parasitic  elements.  More 
important,  perhaps,  it  is  a  condition  one  can  verify  experimentally 


by  frequer.  ;y-response  measurements. 

It  is  physically  apparent  that  a  collection  of  cells  that 
appear  passive  at  their  terminals  will  form  a  stable  system  when 
interconnected  through  a  passive  medium  such  as  a  resistive  grid. 

The  research  contributions,  reported  here  in  summary  form,  are 
i)  a  demonstration  that  this  passivity  or  positive-real  condition 
is  much  stronger  than  we  actually  need  and  that  weaker  conditions, 
more  easily  achieved  in  practice,  suffice  to  guarantee  stability  of 
the  linear  network  model,  and  ii)  an  extension  of  i)  to  the  nonlinear 
domain  that  furthermore  rules  out  large-signal  oscillations  under 
certain  conditions. 

II.  FIRST-ORDER  LINEAR  ANALYSIS  OF  A  SINGLE  CELL 


We  begin  with  a  linear  analysis  of  an  elementary  model  for  the 
circuit  in  Fig.  1.  For  an  initial  approximation  to  the  output 
admittance  of  the  cell  we  simplify  the  topology  (without  loss  of 
relevant  information)  and  use  a  naive'  model  for  the  transconductance 
amplifiers,  as  shown  in  Fig.  2. 


Figure  2.  Simplified  network  topology  and  transconductar.ee  amplifier 
model  for  the  circuit  in  Fig.  1.  The  capacitor  in  Fig.  1  has  been 
absorbed  into  C^. 

Straightforward  calculations  show  that  the  output  admittance  is 
given  by 

9mi9m2R0]. 


Y  (s)  =  (g; 


m  2 


R02  +  s  ^02^  + 
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(1  +  s 


R1  =  ^m5+  R02  ^  *  r2=  ^<?mi'^m2ROi  ^  *  an<^  ^  ^oi^m^n^ 


This  is  a  positive-real,  i.e.,  passive,  admittance  since  it  can  always 
be  realized  by  a  network  of  the  form  shown  in  Fig.  3,  where 
l.-l  „  D  -1 

ml^m2R°l 

Although  the  original  circuit  contains  no  inductors,  the 
realization  has  both  capacitors  and  inductors  and  thus  is  capable 
of  damped  oscillations.  Nonetheless,  if  the  transamp  model  in 
Fig.  2  were  perfectly  accurate,  no  network  created  by  interconnecting 
such  cells  through  a  resistive  grid  (with  parasitic  capacitances) 
could  exhibit  sustained  oscillations.  For  element  values  that  may 
be  typical  in  practice,  the  model  in  Fig.  3  has  a  lightly  damped 
resonance  around  1  KHz  with  a  Q  *  10.  This  disturbingly  high  Q 
suggests  that  the  cell  will  be  highly  sensitive  to  parasitic  elements 
not  captured  by  the  simple  models  in  Fig.  2.  Our  preliminary 


Figure  3.  Passive  network  realization  of  the  output  admittance  (eq. 
(1)  of  the  circuit  in  Fig.  2. 

analysis  of  a  much  more  complex  model  extracted  from  a  physical 
circuit  layout  created  in  Carver  Mead's  laboratory  indicates  that 
the  output  impedance  will  not  be  passive  for  all  values  of  the  trans¬ 
amp  bias  currents.  But  a  definite  explanation  of  the  instability 
awaits  a  more  careful  circuit  modelling  effort  and  perhaps  the  design 
of  an  on-chip  impedance  measuring  instrument. 

III.  POSITIVE-REAL  FUNCTIONS,  9-POSITIVE  FUNCTIONS,  AND 
STABILITY  OF  LINEAR  NETWORK  MODELS 

In  the  following  discussion  s  =  1+ jw  is  a  complex  variable, 

H(s)  is  a  rational  function  (ratio  of  polynomials)  in  s  with  real 
coefficients,  and  we  assume  for  simplicity  that  H(s)  has  no  pure 
imaginary  poles.  The  term  closed  right  half  plane  refers  to  the  set 
of  complex  numbers  s  with  Re{s}  _>  0. 

Def.  1 

The  function  H(s)  is  said  to  be  positive-real  if  a)  it  has  no 
poles  in  the  right  half  plane  and  b)  Re(H(jw)  }  _>  0  for  all  o. 

If  we  know  at  the  outset  that  H(s)  has  no  right  half  plane  poles, 
then  Def.  1  reduces  to  a  simple  graphical  criterion:  H(s)  is  positive- 
real  if  and  only  if  the  Nyquist  diagram  of  H(s)  (i.e.  the  plot  of 
H(ji)  for  i  _>  0,  as  in  Fig.  4)  lies  entirely  in  the  closed  right  half 
plane . 

Note  that  positive-real  functions  are  necessarily  stable  since 
they  have  no  right  half  plane  poles,  but  stable  functions  are  not 
necessarily  positive-real,  as  Example  1  will  show. 

A  deep  link  between  positive  real  functions,  physical  networks 
and  passivity  is  established  by  the  classical  result7  in  linear 
circuit  theory  which  states  that  H(s)  is  positive-real  if  and  only  if 
it  is  possible  to  synthesize  a  2-terminal  network  of  positive  linear 
resistors,  capacitors,  inductors  and  ideal  transformers  that  has  H(s) 
as  its  driving-point  impedance  or  admittance. 


The  function  H(s)  is  said  to  be  6-positive  for  a  particular  value 
of  6(6  ?  0,  5  i-  -)  ,  if  a)  H(s)  has  no  poles  in  the  right  half  plane, 
and  b)  the  N'ycruist  plot  of  H(s)  lies  strictly  to  the  right  of  the 
straight  line  passing  through  the  origin  at  an  angle  9  to  the  real 
positive  axis. 

Note  that  every  9-positive  function  is  stable  and  any  function 
that  is  9-positive  with  6  =  ~/2  is  necessarily  positive-real. 


Figure  4.  Nyquist  diagram  for  a  function  that  is  9-positive  but 
not  positive- real . 

Example  1 

The  function 


=  (s+1) (s+40) 

(s+5)  (s+6)  (s+7)  (2) 

is  9-positive  (for  any  0  between  about  18°  and  68°)  and  stable,  but  it 
is  not  positive-real  since  its  Nyquist  diagram,  shown  in  Fig.  4, 
crosses  into  the  left  half  plane. 

The  importance  of  9-positive  functions  lies  in  the  following 
observations:  1)  an  interconnection  of  passive  linear  resistors  and 
capacitors  and  cells  with  stable  linear  impedances  can  result  in  an 
unstable  network,  b)  such  an  instability  cannot  result  if  the 
impedances  are  also  positive-real,  c)  0-positive  impedances  form  a 
larger  class  than  positive-real  ones  and  hence  9-positivity  is  a  less 
demanding  synthesis  goal ,  and  d)  Theorem  1  below  shows  that  such  an 
instability  cannot  result  if  the  impedances  are  0-positive,  even  if 
they  are  not  positive-real. 

Theorem  1 


Consider  a  linear  network  of  arbitrary  topology,  consisting  of 
any  number  of  passive  2-terminal  resistors  and  capacitors  of  arbitrary 
value  driven  by  any  number  of  active  cells.  If  the  output  impedances 


c 


of  all  the  -.I'tive  cells  are  9-positive  for  some  common  3,  0<G<  — , 
then  the  .te^vcrk  is  stable. 

The  of  Theorem  1  relies  on  Lemma  1  below. 

Lemma  1 


If  H(s)  is  9-positive  for  some  fixed  9,  then  for  all  s0  in  the 
closed  first  quadrant  of  the  complex  plane,  H(s0)  lies  strictly  to 
the  right  of  the  straight  line  passing  through  the  origin  at  an  angle 
3  to  the  real  positive  axis,  i.e.  ,  Re{sQ}  _>  0  and  Imis0;  >_  0  => 

<  L  H<so>  <  ^ 

Proof  of  Lemma  1  (Outline) 


Let  d  be  the  function  that  assigns  to  each  s  in  the  closed  right 
half  plane  the  perpendicular  distance  d(s)  from  H(s)  to  the  line 
defined  in  Def.  2.  Note  that  d(s)  is  harmonic  in  the  closed  right 
half  plane,  since  H  is  analytic  there.  It  then  follows,  by  application 
of  the  maximum  modulus  principle8  for  harmonic  functions,  that  d  takes 
its  minimum  value  on  the  boundary  of  its  domain,  which  is  the 
imaginary  axis.  This  establishes  Lemma  1. 

Proof  of  Theorem  1  (Outline) 


The  network  is  unstable  or  marginally  stable  if  and  only  if  it 
has  a  natural  frequency  in  the  closed  right  half  plane,  and  sQ  is  a 
natural  frequency  if  and  only  if  the  network  equations  have  a  nonzero 
solution  at  s,.  Let  {lu}  denote  the  complex  branch  currents  of  such 
a  solution.  By  Tellegen's  theorem*  the  sum  of  the  complex  powers 
absorbed  by  the  circuit  elements  must  vanish  at  such  a  solution,  i.e.. 


I  V1* 


resistances 


capacitances 


Vs  C.  + 
o  k 


l 

cell 
terminal  pairs 


W'V 


=  0, 


(3) 

where  the  second  term  is  deleted  in  the  special  case  sQ=0,  since  the 
complex  power  into  capacitors  vanishes  at  so=0. 

If  the  network  has  a  natural  frequency  in  the  closed  right  half 
plane,  it  must  have  one  in  the  closed  first  quadrant  since  natural 
frequencies  are  either  real  or  else  occur  in  complex  conjugate  pairs. 

But  (3)  cannot  be  satisfied  for  any  sQ  in  the  closed  first  quadrant, 
as  we  can  see  by  dividing  both  sides  of  (3)  by  ^  (i^i  »  where  the 

sum  is  taken  over  all  network  branches.  After  this  division,  (3) 
asserts  that  zero  is  a  convex  combination  of  terms  of  the  form  % , 
terms  of  the  form  (Cj,s0)-1»  and  terms  of  the  form  ZfcUo). 

Visualize  where  these  terms  lie  in  the  complex  plane:  the  first  set  lies 

on  the  real  positive  axis,  che  second  set  lies  in  the  closed  4-th 
quadrant  since  sQ  lies  in  the  closed  1st  quadrant  by  assumption,  and 
the  third  set  lies  to  the  right  of  a  line  passing  through  the  origin 
at  an  angle  9  by  Lemma  1.  Thus  all  these  terms  lie  strictly  to  the 
right  of  this  line,  which  implies  that  no  convex  combination  of  them 
can  equal  zero.  Hence  the  network  is  stable! 


IV.  STABILITY  RESULT  FOR  NETWORKS  WITH  NONLINEAR 
RESISTORS  AND  CAPACITORS 

The  previous  result  for  linear  networks  can  afford  some  limited 
insight  into  the  behavior  of  nonlinear  networks.  First  the  nonlinear 
equations  are  linearized  about  an  equilibrium  point  and  Theorem  1  is 
applied  to  the  linear  model.  If  the  linearized  model  is  stable,  then 
the  equilibrium  point  of  the  original  nonlinear  network  is  locally 
stable,  i.e.,  the  network  will  return  to  that  equilibrium  point  if 
the  initial  condition  is  sufficiently  near  it.  But  the  result  in  this 
section,  in  contrast,  applies  to  the  full  nonlinear  circuit  model  and 
allows  one  to  conclude  that  in  certain  circumstances  the  network 
cannot  oscillate  even  if  the  initial  state  is  arbitrarily  far  from 
the  equilibrium  point. 

Def .  3 

A  function  H(s)  as  described  in  Section  III  is  said  to  satisfy 
the  Popov  criterion^-®  if  there  exists  a  real  number  r>0  such  that 
Re{  (1  +  jojr)  H  ( j us )  }  ^>0  for  all  oo. 

Note  that  positive  real  functions  satisfy  the  Popov  criterion 
with  r=0.  And  the  reader  can  easily  verify  that  G(s)  in  Example  1 
satisfies  the  Popov  criterion  for  a  range  of  values  of  r.  The  important 
effect  of  the  term  (1+j^jr)  in  Def.  3  is  to  rotate  the  Nyquist  plot 
counterclockwise  by  progressively  greater  amounts  up  to  90°  as  w 
increases . 

Theorem  2 

Consider  a  network  consisting  of  nonlinear  2-terminal  resistors 
and  capacitors,  and  cells  with  linear  output  impedances  Z^(s) .  Suppose 

i)  the  resistor  curves  are  characterized  by  continuously 
differentiable  functions  ik  =  gk(vk)  where  gk(0)  =  0  and 
0  <  gk(vR)  <  G  <  06  for  all  values  of  k  and  vk, 

ii)  the  capacitors  are  characterized  by  ik  =  Ck(Vk)vk  with 

0  <  Cj_  <  Ck(v^)  <  <  00  for  all  values  of  k  and  vk, 

iii)  the  impedances  Zk(s)  have  no  poles  in  the  closed  right 
half  plane  and  all  satisfy  the  Popov  criterion  for  some  common 
value  of  r. 

If  these  conditions  are  satisfied,  then  the  network  is  stable  in  the 
sense  that,  for  any  initial  condition, 

r 

£  ij^t)  dt  <  00  .  (4) 

•'O^all  branches 

The  proof,  based  on  Tellegen’s  theorem,  is  rather  involved.  It 
will  be  omitted  here  and  will  appear  elsewhere. 
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Preface 


This  manuscript  consists  of  the  lecture  notes  for  6.848/18.435  Theory  of  Parallel  and 
VLSI  Computation  as  it  was  taught  by  Professors  Leighton  and  Leiserson  in  the  Fall  of 
19S7.  The  notes  were  written  by  students  in  the  class  and  were  reviewed  and  organized 
by  Bruce  Maggs.  Serge  Plotkin,  and  Joel  Wein,  who  served  as  Teaching  Assistants  for 
the  class.  The  notes  are  not  meant  to  be  in  polished  form,  and  they  probably  contain 
several  errors  and  omissions,  particularly  with  respect  to  references  in  the  literature. 
Rather,  they  are  intended  to  be  an  aid  in  teaching  and/or  studying  the  introductory 
work  in  networks,  parallel  computation,  and  VLSI  Design.  For  a  more  complete  and 
formal  treatment  of  the  work  in  this  area,  we  refer  the  reader  to  the  references  listed  in 
the  reading  assignments  and/or  Professor  Leighton’s  forthcoming  book  on  this  subject. 

The  class  was  attended  by  about  35  students,  25  of  whom  took  the  course  for  credit. 
No  previous  familiarity  with  networks,  parallel  computation,  or  VLSI  was  assumed,  but 
we  did  require  that  students  have  gained  some  familiarity  with  data  structures  and 
algorithms  before  taking  the  course.  The  lectures  were  supplemented  with  biweekly 
problem  sets  and  a  list  of  topics  for  research. 

It  is  our  intention  to  update  these  notes  each  year  that  the  course  is  taught,  and  we 
would  appreciate  knowing  of  any  errors  and  omissions  in  the  current  notes  so  that  they 
may  be  corrected  in  future  notes.  Currently,  we  plan  to  teach  the  course  during  each  fall 
term. 

Notes  for  the  advanced  course  on  this  subject  ( 6.849/18.436  Advanced  Parallel  and 
VLSI  Computation )  are  also  available  as  an  MIT-LCS  Research  Seminar  Series  RSS-2. 
The  advanced  course  was  last  taught  in  the  Spring  of  1987,  and  we  plan  to  teach  it  every 
other  spring  term  henceforth. 
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Computation  as  it  was  taught  by  Professors  Leighton  and  Leiserson  in  the  Spring  of  1987.  The 
notes  were  written  by  students  in  the  class  and  were  reviewed  and  organized  by  Bruce  Maggs.  Serge 
Plotkin,  and  Joel  Wein.  who  served  as  Teaching  Assistants  for  the  class.  The  notes  are  not  meant 
to  be  in  polished  form,  and  they  probably  contain  several  errors  and  omissions,  particularly  with 
respect  to  references  in  the  literature.  Rather,  they  are  intended  to  be  an  aid  in  teaching  and  for 
studying  some  of  the  advanced  work  in  networks,  parallel  computation,  and  VLSI  design.  For  a 
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The  class  was  attended  by  about  25  students,  15  of  whom  took  the  course  for  credit.  Students 
were  assumed  to  have  taken  an  introductory  course  on  this  subject  (e.g.  6.848/18.435  Theory  of 
Parallel  and  VLSI  Computation)  as  a  prerequisite.  The  lectures  were  supplemented  with  monthly 
problem  sets  and  two  programming  projects  on  a  16,000-processor  Connection  Machine  provided 
by  Thinking  Machines  Corporation  of  Cambridge,  Mass. 

It  is  our  intention  to  update  these  notes  each  year  that  the  course  is  taught.  Currently,  we  plan 
to  teach  the  course  every  other  spring.  We  would  appreciate  learning  of  any  errors  and  omissions 
in  the  current  notes  so  that  they  can  be  corrected  in  future  editions. 

Notes  for  the  introductory  course  on  this  material  ( 6.848/18.435  Theory  of  Parallel  and  VLSI 
Computation )  are  also  available  as  an  MIT-LCS  Research  Seminar  Series  RSS-1.  The  introductory 
course  was  last  taught  during  the  Fall  of  1987,  and  is  scheduled  to  be  taught  during  each  Fall  Term 
henceforth. 
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A  MIXED  FREQUENCY-TIME  APPROACH  FOR  FINDING  THE  STEADY-STATE 
SOLUTION  OF  CLOCKED  ANALOG  CIRCUITS 


K.  Kundert,  J.  White,  and  A.  Sangiovanni-Vmcentelli 


Abstract 

Performing  detailed  simulation  of  clocked  analog  circuits  (e.g.  switched-capacitor  filters 
and  switching  power  supplies)  with  circuit  simulation  programs  like  SPICE  is 
computationally  very  expensive.  In  this  paper  we  present  a  new,  more  efficient,  method 
for  computing  the  detailed  steady-state  solution  of  clocked  analog  circuits.  The  method 
exploits  the  property  of  such  circuits  that  the  waveforms  in  each  clock  cycle  are  similar 
but  not  exact  duplicates  of  the  proceeding  or  following  cycles.  Therefore,  by  computing 
accurately  a  few  selected  cycles,  the  entire  steady-state  solution  can  be  constructed 
efficiently. 
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FAST  ON-CHIP  DELAY  ESTIMATION  FOR  CELL-BASED  EMITTER  COUPLED 
LOGIC 


Peter  R.  O’Brien,  John  L.  Wyatt,  Jr.,  Thomas  L.  Savarino,  and  James  M.  Pierce 


Abstract 

The  goal  of  this  work  is  to  produce  fast,  but  accurate,  estimates  of  best  and  worst  case 
delay  for  on-chip  emitter  coupled  logic  (ECL)  nets.  The  work  consists  of  two  major 
parts:  1)  macromodelling  of  ECL  logic  gates  acting  as  both  sources  and  loads;  and  2) 
delay  estimation  for  individual  nets  using  the  gate  macromodel  parameters  and  RC  tree 
models  for  metal  interconnect.  Both  of  the  above  functions  (gate  macromodeling  and 
delay  estimation)  have  been  extensively  tested  on  an  industrial  ECL  process  and  cell 
(i.e.,  logic  gate)  library. 

The  success  of  a  macromodelling  approach  relies  on  repetitive  use  of  members  of  a 
library  of  modelled  cells.  A  “fixed”  computational  cost  (several  c.p.u.  hours  per  cell)  is 
paid  to  obtain  simplified  macromodei  parameter  values.  Resultant  timing  estimates  are 
typically  within  5-10%  of  SPICE  and  are  obtained  roughly  three  orders  of  magnitude 
more  quickly  than  SPICE. 
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ABSTRACT 

The  goal  of  this  work  is  to  produce  fast,  but  accu¬ 
rate.  estimates  of  best  and  worst  case  delay  for  on-chip 
emitter  coupled  logic  (ECLl  nets.  The  work  consists 
of  two  major  parts:  li  macromodelling  of  ECL  logic 
gates  acting  as  both  sources  and  loads,  and  21  delay  es¬ 
timation  for  individual  nets  using  the  gate  macromodel 
parameters  and  RC  tree  models  for  metal  interconnect . 
Both  of  the  above  functions  (gate  macromodelling  and 
delay  estimation!  have  been  extensively  tested  on  an  in¬ 
dustrial  ECL  process  and  cell  (i.e.,  logic  gate!  library. 

The  success  of  a  macromodeliing  approach  relies  on 
repetitive  use  of  members  of  a  library  of  modelled  cells. 
A  "fixed"  computational  cost  | several  c.p.u.  hours  per 
cell)  is  paid  to  obtain  simplified  macromodel  parameter 
values.  Resultant  timing  estimates  are  typically  within 
5-101  of  SPICE  1  and  are  obtained  roughly  three 
orders  of  magnitude  more  quickly  than  SPICE. 


I.  INTRODUCTION 
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So.  “metal  delay"  has  two  distinct  components:  er- 
tra  delay  through  the  source  gate  caused  by  the  ioad- 
ing  of  the  source  gate  by  metal,  and  propagation  delay 
through  the  metal  itself.  Worst  case  (or  “slow"  I  metal 
delay  is  simply  the  definition  in  ( I )  evaluated  using  slow 
SPICE  process  parameters  for  the  logic  gates  and  metal 
interconnect,  the  maximum  expected  input  risetime  a. 
point  A.  and  a  slow  target  voltage  threshold  at  points 
B  and  C.  The  slow  target  voltage  thresholds  for  rising 
and  falling  transitions  are  chosen  to  be.  respectively: 
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for  some  Dn,„e  >  0.  The  definition  of  best  case 

(or  “fast"!  metal  delay  is  made  in  a  similar  way  with 
fast  versions  of  SPICE  parameters,  input  risetime,  and 
output  voltage  threshold. 


Definition  of  te-ms 


The  definition  of  “metal  delay"  can  best  be  illus¬ 
trated  by  a  simplified  interconnect  net  with  no  branch¬ 
ing  and  only  one  load  gate  [Fig.  lj.  Let  Tab(0)  rep¬ 
resent  delay  through  the  unloaded  source  gate.  Let 
TaB(L)  represent  delay  through  the  source  gate  loaded 
by  an  interconnect  net  of  iength  L,  as  shown  in  Fig.  1. 
For  all  gates  in  our  cell  library.  TAg(0 )  is  known.  What 
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Figure  1:  Simplified  interconnect  net. 
our  algorithm  estimates  is  “metai  delay"  defined  by: 


Gate  delay  vs.  interconnect  delay 


Previous  work  on  waveform  bounding  and  estima¬ 
tion  for  RC  tree  networks  (2.3 .4 i .  wdth  application  to 
MOS  circuits,  has  focused  on  the  propagation  com¬ 
ponent  (Tgc)  of  interconnect  delay.  Relatively  simple 
models  are  used  for  the  logic  gates  themselves.  More 
recently,  detailed  macromodels  for  MOS  logic  gates  5 
have  been  developed  and  used  in  conjunction  with  the 
RC  tree  delay  estimation  results.  Good  correlation  w ith 
SPICE  is  obtained  by  fitting  macromodel  parameters  to 
selected  SPICE  experiments.  We  develop  macromodels 
specifically  for  ECL  gates  at  a  level  of  detail  similar  to 
!5].  This  fills  a  definite  need  since  most  reported  work  in 
this  area  has  been  for  MOS  circuits,  even  though  bipo¬ 
lar  digital  circuits  are  presently  in  wide  use  for  high- 
performance  applications.  Recent  work  that  has  been 
reported  for  bipolar  circuits  |6.7.8  is  concerned  mainly 
with  logic  simulation,  and  the  timing  models  used  are 
relatively  simple. 


To  emphasize  the  importance  of  accurately  mod¬ 
elling  the  source  gate  las  opposed  to  just  the  inter¬ 
connect:.  in  Fig  -  Ae  piot  separateiy  the  two  com¬ 
ponents  of  •  rising  transition,  worst  easel  metal  delay 
versus  total  load  re’  capacitance  for  a  uniformly  dis¬ 
tributed  metal  line  .  •  -;.h  simplified  topology  of  Fig.  1. 
The  gate  used  as  noth  tne  source  and  the  ioad  is  a  stan- 
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Figure  2:  Two  components  of  Dmltai  vs.  total  load  net 
capacitance. 

dard  2-input  OR  NOR.  We  denote  the  total  load  net 
capacitance  by  “C.vrr.”  and  the  maximum  expected 
value  of  Cset  by  U(Tv/OC •*  for  low  values  of  Chet - 
extra  source  gate  deiav  dominates  propagation  delay: 
the  two  become  equal  only  when  Chet.'Cmax  *=  0.51. 
Furthermore,  the  statistical  distribution  of  Chet  is  not 
uniform  across  0,  C max  ,  but  rather  is  skewed  towards 
lower  capacitance  vaiues.  In  fact,  for  our  designs.  90% 
of  the  load  nets  have  Chet!Cmax  values  under  0.25, 
where  propagation  deiav  is  only  42%  of  extra  source 
gate  delay.  In  addition,  for  a  falling  transition,  extra 
source  gate  deiav  is  even  more  important  than  shown  in 
Fig.  2,  typically  exceeding  propagation  delay  through¬ 
out  the  entire  range  0  <  Chet  <  Cmax-  Therefore, 
extra  source  gate  delay  is  typically  the  dominant  com¬ 
ponent  of  metal  delay. 

II.  LOGIC  GATE  MACROMODELLING 
Load  modelling 

Modelling  of  ECL  gates  as  loads  is  very  simple.  As 
in  MOS.  a  single  linear  lumped  capacitor  is  an  accept¬ 
able  load  model.  Modelling  of  leakage  current  in  ECL 
loads  does  not  appear  to  be  necessary.  For  our  process, 
the  worst  case  (maximum  expected  metal  resistance, 
maximum  expected  fanout)  voltage  drop  through  metal 
due  to  steady  state  leakage  current  is  less  than  3%  of 
the  rail-to-rail  voltage  swing.  In  situations  where  leak¬ 
age  current  might  be  modelled  (e.g..  clock  nets  with 


very  high  fanout),  this  could  be  done  bv  adding  a  linear 
resistor  and  a  d.c.  current  source  in  parallel  with  the  ca¬ 
pacitor  9.10,.  The  capacitance  value  is  extracted  from 
SPICE  simulations  of  transient  voltage  at  the  load  and 
current  into  the  load.  Since  on-chip  metal  is  modelled 
as  a  linear  RC  tree,  and  load  gales  as  simple  linear 
capacitors,  this  means  an  entire  load  net  Imetal  and 
loads)  is  modelled  as  a  linear  RC  tree. 

Source  modelling 

Modelling  of  ECL  gates  as  sources  is  more  difficult. 
Because  of  an  emitter-follower  output  stage,  source  gate 
behavior  exhibits  a  fundamental  asymmetry  between 
rising  and  falling  transitions.  For  a  falling  transition, 
there  is  a  limitation  on  transient  sinking  current  as 
Cset  increases.  We  use  two  different  source  model 
types:  the  first  one  for  falling  transitions  when  Cset  _ 
Cthresh  (to  model  the  transient  sinking  current  limi¬ 
tation),  and  the  second  one  for  falling  transitions  when 
Cset  <  Cthresh  and  for  all  rising  transitions.  The 
“threshold”  capacitance  ( Cthresh )  is  determined  from 
SPICE  simulations  of  transient  source  gate  output  cur¬ 
rent  (/ou()  during  falling  transitions.  Cthresh  is  de¬ 
fined  to  be  the  value  of  Cset  where  the  sensitivity  of 
/out  U„  to  a  perturbation  in  Cset  drops  below  a  pre¬ 
determined  value. 

To  model  the  sinking  current  limitation,  the  first 
source  gate  model  is  simply  a  delayed  current  source 
pulse.  The  model  delay  before  the  onset  of  the  cur¬ 
rent  pulse  and  the  magnitude  of  the  pulse  (Is at)  afe 
extracted  from  the  same  SPICE  simulations  used  to  de¬ 
termine  Cthresh ■  The  duration  of  the  pulse  is  exactly 
long  enough  to  sink  the  correct  amount  of  charge: 

.  ,  a  Chet  ft  high  -  Vlow)  ... 

pulse  duration  = - - - .  (4) 

/ sat 

The  second  source  gate  model  is  based  on  the  source 
gate's  d.c.  drive  curves,  w  hich  show  the  static  output 
voltage  vs.  input  voltage  relationship  for  different  val¬ 
ues  of  output  load  current.  To  describe  a  family  of 
three-segment  piecewise-linear  approximations  to  the 
d.c.  drive  curves,  four  “d.c.  parameters”  are  obtained 
by  curve  fitting  to  d.c.  SPICE  output  (see  also  i3.5|). 
These  d.c.  parameters  alone  are  sufficient  to  model  the 
source  gate’s  response  to  slow  inputs,  when  the  gate 
behaves  quasi-statically.  However,  an  ECL  input  is 
usually  too  fast  for  the  source  gate  to  respond  quasi- 
staticaily.  The  source  gate  responds  somewhat  more 
slowly  than  the  quasi-static  model  alone  would  predict. 
So.  four  additional  “dynamic  parameters”  are  extracted 
from  SPICE  data  of  transient  source  gate  responses  in 
order  to  model,  as  a  function  of  Chet >  the  departure 
from  a  purely  quasi-static  response. 

Each  of  the  two  source  gate  models  is  used  in  con¬ 
junction  with  an  approximation  to  the  driving-point 


admittance  of  the  load  net  given  by  a  single  lumped 
RC  segment  {Rset-  Cset)-  Based  on  values  for  fiver. 
Cset ■  the  sourer  ec  macromodel  parameters,  and  the 
input  risetime.  a  >- -i-form  analytical  expression  for 
the  mode!  »a'.  t  u  -  -:.  -p-t)  is  obtained.  The  detailed 
monel  expression*  j-t  -mitled  here  for  brevity,  but  can 
be  found  in  11. 

Ntarrornodel  parameter  summary 


A  total  of  2(/e ri  load  gate  macromodel  parameter 
values  icapacitance)  are  extracted  for  each  cell,  where 
lev  denotes  the  number  of  input  levels  of  the  cell  being 
modelled.  (Note:  the  term  “input  level”  refers  to  a 
subset  of  the  individual  input  signals  that  affect  the 
current-steering  logic  through  the  same  number  of  base- 
emitter  junction  voltage  drops.)  Capacitance  values 
are  obtained  for  each  combination  of  slow  fast  SPICE 
parameters  and  different  input  level. 

A  total  of  76  source  gate  macromodel  parameter  val¬ 
ues  are  extracted  for  each  cell:  12  for  the  first  source 
model  (i.e..  falling  transition  and  Cset  i  Cthresh)- 
and  64  for  the  second  source  model.  For  the  first  model: 
3  parameters  ICrp resh •  IsaT ■  and  current  pulse  delay) 
for  each  combination  of  slow  fast  SPICE  parameters 
and  “true  complement"  output  side  of  the  cell.  For 
the  second  model:  8  parameters  (4  "d.c."  and  4  “dy¬ 
namic")  for  each  combination  of  slow  fast  SPICE  pa¬ 
rameters.  "true  complement"  output  side  of  the  cell, 
and  rising  falling  output  transition. 

III.  REDUCED-ORDER  INTERCONNECT 
MODELS 

Driving-point  admittance  approximation 


As  mentioned  in  the  previous  section,  to  enable  the 
computation  of  an  analytical  expression  for  the  model 
waveform  vgft),  the  driving-point  admittance  of  the 
load  net  is  approximated  by  a  single  lumped  RC  seg¬ 
ment  (Rnet-Cset)-  The  values  of  Rset  and  CNet  are 
chosen  to  match  the  first  two  terms  of  the  Taylor  series 
expansion  around  s  =  0  of  the  driving-point  admittance 
of  the  given  load  net. 

CO 

}  load  net{s)  —  (5) 

n=l 


where  the  series  representation  is  valid  only  within  some 
circle  of  convergence  1  s  <  r„„„.  Our  approximate 
driving-point  admittance  is: 


YaPPROx{s) 


sCsET 

1  -  sRsetCset 

^  (-1)”  '  Rset"-' C set" snA6) 

n=  1 


where  the  second  equality  is  valid  only  within  the  circle 
of  convergence  s  <  1  RsetCset  So  to  match  both 
the  s  and  s:  terms,  we  choose: 

Cset  =  Sri 

Rset  =  ~Vz  t‘  SI 

The  approximation  is  computed  quickiv  using  an  algo¬ 
rithm  11  which  allows  the  computation  of  yl  and  y2 
(and.  hence,  of  Rset  and  Cset'  to  proceed  sequen¬ 
tially  upstream  from  the  leaf  nodes  of  the  load  net  un¬ 
til  the  source  gate  output  is  finally  reached.  The  low- 
frequency  nature  of  this  approximation,  implicit  in  the 
use  of  a  Taylor  series  expansion  around  s  =  0.  turns 
out  to  be  entirely  justifies.  For  our  process,  most  of 
the  frequency  content  of  a  typical  source  gate  output 
waveform  lies  well  inside  the  circle  of  convergence  for 
both  the  true  and  approximate  driving-point  admit¬ 
tance  11,. 


Voltage  transfer  function  approximation 


We  propagate  the  model  source  gate  output  volt¬ 
age  waveform  (vb(())  downstream  to  the  load(s)  of  in¬ 
terest  by  convolving  with  an  approximate  unit  voltage 
impulse  response  first  developed  by  Horowitz  [3..  W'e 
use  an  approximate  impulse  response  because  obtaining 
the  precise  impulse  response  of  a  general  RC  tree  is  too 
computationally  expensive.  In  addition,  closed-form 
analytical  expressions  are  available  for  the  approximate 
impulse  response.  This  allows,  via  convolution  with  the 
model  source  gate  output  voltage  waveform,  computa¬ 
tion  of  closed-form  analytical  expressions  for  the  model 
voltage  waveform(s)  at  the  loadisl  ft’c(t)).  The  model 
voltage  waveform  at  each  load  is  then  numerically  in¬ 
verted,  at  the  appropriate  voltage  threshold,  in  order 
to  compute  the  metal  delay  to  that  load. 

Let  h(t)  and  hppproz[t)  denote,  respectively,  the  true 
and  approximate  unit  voltage  impulse  response  at  a 
given  load.  Let  H{s)  and  Efapp,0I(s)  denote,  respec¬ 
tively  at  the  same  load,  the  Laplace  transform  of  the 
true  and  approximate  impulse  response.  The  approxi¬ 
mate  transfer  function  has  two  poles  and  one  zero: 


1  -  ST, 


(1  *  Sr,)(l  -  ST;)' 


(9) 


The  time  constants  t„  rt.  and  r:  are  determined  by  the 


following  three  constraints: 

[  th.}pprox[t)dt  = 

f  tk(t)dt 

(10) 

Jo 

Jo 

I”  t*happr„(t)dt  = 
Jo 

f  t2k{t)dt 
Jo 
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1  —  (r,  -t-  r2)  s  ~  a2s: 


H(s). 


(12) 
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Figure  3:  Rising  transition  comparison. 


Figure  4:  Falling  transition  comparison. 


In  Figs.  3  and  4,  we  show  comparisons  of  SPICE  vs. 
our  algorithm.  In  Fig.  3,  we  use  the  same  SPICE  data 
shown  in  Fig.  2.  In  Fig.  4,  we  use  the  same  gate  (2- 
input  OR/NOR)  and  the  same  net  topology  (uniform 
unbranched  line  with  a  single  load),  but  we  examine  a 
falling  transition.  Two  points  to  note  about  Fig.  4  are: 

1.  the  boundary  between  the  two  different  source 
model  types  is  Cset  =  Cthresh  =  0.43 Cmax 
for  this  particular  gate:  and 

2.  the  two  Tec  curves  are  nearly  indistinguishable 
on  the  time  scale  of  the  plot. 

Assuming  that  gate  macromodel  parameters  have 
been  obtained  in  advance,  the  computation  speed-up 
relative  to  SPICE  is  approximately  three  orders  of  mag¬ 
nitude.  Similar  accuracy  and  speed-up  results  are  ob¬ 
tained  using  a  wide  variety  of  different  logic  cells  and 
non-uniform  branched  net  topologies  (111. 
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lateral  inhibition  networks  by  using  a  locally  connected  on-chip  resistive  grid.  A  serious 
problem  of  unwanted  spontaneous  oscillation  often  arises  with  these  circuits  and 
renders  them  unusable  in  practice.  This  paper  reports  a  design  approach  that 
guarantees  such  a  system  will  be  stable,  even  though  the  values  of  designed  elements  in 
the  resistive  grid  may  be  imprecise  and  the  location  and  values  of  parasitic  elements  may 
be  unknown.  The  method  is  based  on  a  rigorous,  somewhat  novel  mathematical 
analysis  using  Tellegen’s  theorem  and  the  idea  of  Popov  multipliers  from  control  theory. 

It  is  thoroughly  practical  because  the  criteria  are  local  in  the  sense  that  no  overall 
analysis  of  the  interconnected  system  is  required,  empirical  in  the  sense  that  they 
involve  only  measurable  frequency  response  data  on  the  individual  cells,  and  robust  in 
the  sense  that  unmodelled  parasitic  resistances  and  capacitances  in  the  interconnect 
network  cannot  affect  the  analysis. 


Microsystems 
Research  Center 
Room  39-321 


Massachusetts 
Institute 
of  Technology 


Cambridge 

Massachusetts 

02139 


Teieohone 
(6171  253-8138 


IIRCL'IT  EESIDN  CRITERIA  FOR  STAB  EE  LATERAL  INHI3ITTDN  NEURAL  NETWCFFE 


-  ■  L .  WYATT ,  Jr .  and  L .  L .  STAND  LE  Y 


Department  cf  Electrical  Engineering  and  Computer  Science 
Mas sacr.use 1 1 s  Institute  of  Technology 
Cambridge,  Massachusetts  01139 


AESTRAJT 

In  the  analog  VLSI  implementation  of  r.e- _ _ 

sysceras,  it  is  sometimes  convenient  to  build 
lateral  inhibition  networks  by  using  a  locally 
connected  or.-chip  resistive  grid.  A  serious  prob¬ 
lem  of  unwanted  spontaneous  oscillation  often  arises 
with  these  circuits  and  renders  them  unusable  in 
practice.  This  paper  reports  a  design  approach,  that 
guarantees  suc.o  a  system  will  be  stable,  even  though 
the  values  cf  designed  elements  in  the  resistive 
grid  may  be  imprecise  and  the  location  and  values  of 
parasitic  elements  may  be  unknown.  The  method  is 
based  on  a  rigorous,  somewhat  novel  mathematical 
analysis  using  Telleger. 's  theorem  and  the  idea  of 
Popov  multipliers  from  control  theory.  It  is 
thoroughly  practical  because  the  criteria  are  local 
in  the  sense  that  no  overall  analysis  of  the  inter¬ 
connected  system  is  required,  empirical  in  the 
sense  that  they  involve  only  measurable  frequency 
response  data  on  the  individual  cells,  and  robust 
in  the  sense  that  unmcdelled  parasitic  resistances 
and  capacitances  in  the  interconnect  network  cannot 
affect  the  analysis. 

I .  INTRODUCTION 

The  term  "lateral  inhibition"  first  arose  in 
neurophysiology  to  describe  a  common  form  of  neural 
circuitry  in  which  the  output  of  each  neuron  in 
some  population  is  used  to  inhibit  the  response  of 
each  of  its  neighbors.  Perhaps  the  best  understood 
example  is  the  horizontal  cell  layer  in  the  verte¬ 
brate  retina,  in  which  lateral  inhibition  simul¬ 
taneously  enhances  intensity  edges  and  acts  as  an 
automatic  gain  control  to  extend  the  dynamic  range 
of  the  retina  as  a  whole  [I].  The  principle  has 
been  used  m  the  design  of  artificial  neural  system 
algorithms  by  Kohonen  [1]  and  others  and  in  the 
electronic  design  of  neural  chips  by  Carver  Mead  et. 
al.  [3,41. 

In  the  VLSI  implementation  of  neural  systems, 
it  is  convenient  to  build  lateral  inhibition  net¬ 
works  by  using  a  locally  connected  on-chip  resistive 
grid.  Linear  resistors  fabricated  in,  e.g.,  poly¬ 
silicon,  yield  a  very  compact  realization,  and  non¬ 
linear  resistive  grids,  made  from  MOS  transistors, 
have  been  found  useful  for  image  segmentation.  (4, 

5] .  Networks  of  this  type  can  be  divided  into  two 
classes:  feedback  systems  and  feedforward-only 
systems.  In  the  feedforward  case  one  set  of 
amplifiers  imposes  signal  voltages  or  currents  on 
the  grid  and  another  set  reads  out  the  resulting 


response  for  subsequent  processing,  while  the  -ire 
amplifiers  both  "write"  to  the  grid  and  "read"  :r:r 
it  in  a  feedback  arrangement,  feedforward  networks 
of  this  type  are  inherently  stable,  but  feeiba;.< 
networks  need  net  be. 

A  practical  example  is  one  of  Carver  Mead's 
retina  chips  [3)  that  achieves  edge  enhancement  bv 
means  of  lateral  inhibition  through  a  resistive 
grid.  Figure  1  shews  a  single  ceil  i.n  a  continuous - 
time  version  of  this  chip.  Note  that  the  capacitor 
voltage  is  affected  both  by  the  local  light  intensity 
incident  on  that  cell  and  by  the  capacitor  voltages 
on  neighboring  cells  of  identical  design.  Any  cell 
drives  its  neighbors,  which  drive  both  their  dis¬ 
tant  neighbors  and  the  original  cell  in  turn.  Thus 
the  necessary  ingredients  for  instability-active 
elements  and  signal  feedback— are  both  present  m 
this  system,  and  in  fact  the  continuous-time  versicr. 
oscillates  so  badly  that  the  original  design  is 
scarcely  usable  m  practice  with  the  lateral  inhi¬ 
bition  paths  enabled.  [6]  Suc.n  oscillations  can 


/ 


Figure  1.  This  photoreceptor  and  signal  processor 
circuit,  using  two  MOS  transconductance  amplifiers, 
realizes  lateral  inhibition  by  communicating  with 
similar  units  through  a  resistive  grid. 

readily  occur  in  any  resistive  grid  circuit  with 
active  elements  and  feedback,  even  when  each 
individual  cell  is  quite  stable.  Analysis  of  the 
conditions  of  instability  by  straightforward  methods 
appears  hopeless,  since  the  number  of  simultaneously 
active  feedback  loops  is  enormous. 

This  paper  reports  a  practical  design  approach 
that  rigorously  guarantees  such  a  system  will  be 
stable.  The  very  simplest  version  of  the  idea  is 
intuitively  obvious:  design  each  individual  cell  so 
that,  although  internally  active,  it  acts  like  a 
passive  system  as  seen  from  the  resistive  grid.  In 


circuit  theory  language,  the  design  goal  here  is 
that  each  cell's  output  impedance  should  t>e  a 


positive-real  i" 


ticn.  This  is  sometimes  not 


too  difficult  ir.  tract.  ;e;  we  will  sr.ow  that  the 
original  network  in.  7ir.  1  satisfies  this  condition 
m  the  absence  of  rerta.n  parasitic  elements.  More 
important,  perhaps.  - :  is  a  condition  one  can  verify 
experimentally  cy  frequency-response  measurements. 

It  is  physically  apparent  that  a  collection 
of  cells  that  appear  passive  at  their  terminals  will 
form  a  staple  system  wh.er.  interconnected  through 
a  passive  medium  such  as  a  resistive  grid.  The 
research  contributions,  reported  here  in  summary 
form,  are  i)  a  demonstration  that  this  passivity  or 
positive-real  condition  is  much  stronger  than  we 
actually  need  and  that  weaker  conditions,  more  easily 
achieved  in  practice,  suffice  to  guarantee  stability 
of  the  linear  network  model,  ar.d  11!  an  extension  to 
the  nonlinear  domain  that  furthermore  rales  out 
large-scale  oscillations  under  certain  conditions. 


II. 


FIRST-ORDER  LINEAR  ANALYSIS  OF  A 
SINGLE  CELL 


We  begin  with  a  linear  analysis  of  an  elemen¬ 
tary  model  for  the  circuit  in  Fig.  1.  For  an  initial 
approximation  to  the  output  admittance  of  the  cell 
we  simplify  the  topology  (without  loss  of  relevant 
information)  and  use  a  naive'  model  for  the  trans- 
conductance  amplifiers,  as  shown  in  Fig.  2. 


Figure  2.  Simplified  network  topology  and  trans¬ 
conductance  amplifier  model  for  the  circuit  in  Fig. 
1.  The  capacitor  in  Fig.  1  has  been  absorbed  into 
C„ 


-o2- 


Straightforward  calculations  show  that  the 
output  admittance  is 


Y  ( s) 


^m2*Ro2  +s  *"02^ 


9m2<Jm2Roi 


(1+s  Rq^Cq^) 


(1) 


This  is  a  positive-real,  i.e.,  passive,  admittance 
that  could  always  be  realized  by  a  network  of  the 
form  shown  in  Fig.  3,  where 


r2*  ^m29m2Bo2 


-1 


and  L  =  :oi/gm,g. 


"I’m2 


Although  the  original  circuit  contains  no 
inductors,  the  realization  has  both  capacitors  and 
inductors  and  thus  is  capable  of  damped  oscillations. 
Nonetheless ,  if  the  transaap  model  in  Fig.  2  were 
perfectly  accurate,  no  network  created  by  inter¬ 
connecting  such  cells  through  a  resistive  grid  (with 
parasitic  capacitances)  could  exhibit  sustained 
oscillations  since  all  the  elements  are  passive. 

For  element  values  that  may  be  typical  in  practice, 
the  model  in  Fig.  3  has  a  lightly  damped  resonance 
around  1  KHz  with  a  Q  •  10.  This  disturbingly  high 
Q  suggests  that  the  cell  will  be  highly  sensitive 


to  parasitic  elements  not  captured  by  the  simple 
models  in  Fig.  2.  Our  preliminary  analysis  ::  a 


Figure  3.  Passive  network  realization  of  the  in¬ 
put  admittance  lee.  (l)J  of  the  circuit  ir.  Fig.  i  . 

much  more  complex  model  extracted  from  a  physical 
circuit  layout  created  m  Carver  Mead's  laboratory 
indicated  that  the  output  impedance  will  rot  be 
passive  for  all  values  of  the  transamp  bias  currents. 
But  a  definite  explanation  of  the  instability  awaits 
a  more  careful  circuit  modelling  effort  and  pernaps 
the  design  of  an  on-chip  impedance  measuring 
instrument. 

III.  STABILITY  OF  A  LINEAR  MODEL  FOR  THE  NETWORK 

Transistor  parasitics  and  layout  parasitics 
will  cause  the  output  admittance  of  the  individual 
active  cells  to  deviate  from  the  form  given  m  eg. 

(1)  and  Fig.  3,  and  any  very  accurate  model  will 
necessarily  be  quite  high  order.  The  following 
theorem  shows  what  sort  of  deviations  we  car.  allow 
and  still  guarantee  that  the  network  is  stacle. 

Terminology 

The  term  closed  riqht  half  plane  refers  to  the 
set  of  complex  numbers  s  *  c  3  -  with  c>0  and  the 
term  closed  third  quadrant  refers  to  the  set  of 
complex  numbers  with  :<0  and  -<Q .  A  natural 
frequency  is  a  complex  frequency  sQ  such  that, 
when  all  branch  impedances  and  admittances  are 
evaluated  at  s0,  there  exists  a  nonzero  solution 
for  the  complex  branch  voltages  {Vi,}  and  currents 

dk}. 

Theorem  1 

Consider  a  linear  network  of  arbitrary  topology, 
consisting  of  any  number  of  positive  2-terminal 
resistors  and  capacitors  and  of  N  lumped  linear 
admittances  Yn(s),  n»l,2,...,N,  having  no  poles 
or  zeroes  in  the  closed  right  half  plane.  Then 
the  network  is  stable,  in  the  sense  that  it  has  no 
natural  frequency  in  the  closed  right  half  plane 
except  perhaps  at  the  origin,  if  at  each  frequency 
u>0  there  exists  a  phase  angle  9  (u)  such  that 

and  !^.Yn(ju)  -  9(jw)|<  90*,  n»l,2,...,N. 

An  equivalent  statement  of  this  last  condition 
is  that  the  Nyquist  plot  of  each  cell's  output  ad¬ 
mittance  for  w_>0  never  intersects  the  closed  3rd 
quadrant,  and  that  no  two  cell's  output  admittance 
phase  angles  can  ever  differ  by  as  much  as  180°. 

If  all  the  active  cells  are  designed  identically 
and  fabricated  on  the  same  chip,  their  phase  angles 


should  track  closely  ir.  practice  and  thus  this 
second  condition  is  a  natural  one. 

Note  that  the  drove  statement  of  the  theorem 
does  net  rule  out  tr.e  possibility  of  an  unusual 
instability  arising  from  a  repeated  natural  fre¬ 
quency  at  the  origin.  Sut  a  more  careful  argument, 
omitted  here,  snows  mat  tr.e  only  possible  nonzero 
network  solution  at  s-0  is  the  stable  one  in  which 
capacitors  m  capacitor-cniy  loops  have  nonzero  d . c . 
voltages  and  all  other  branen  voltages  and  currents 
vanish . 


Proof  of  Theorem  1 


Let  s0  denote  a  natural  frequency  cf  the 
network  and  {Vj.;  denote  the  complex  hrar.cr.  currents 
at  a  corresponding  solution.  By  Tellegen's  theorem 
[31 ,  or  conservation  of  complex  power,  we  have 


:  v*  '«■< 
resistances 


*  -  soC< 
capacitances 


■'k 


ceu 

output  branches 


<2; 


Solutions  of  the  fora  s  ■  j«  #  0  can  be  ruled  out 
as  follows.  Note  that  ?cr  each  .>0  all  the  cell 
admittance  values  Yn(jw)  lie  strictly  above  and  to 
the  right  of  a  straight  line  through  the  origin  of 
the  complex  plane  making  an  angle  of  i(~)  -  90°  with 
the  real  positive  axis.  The  capacitance  admittances 
(j-Ck!“  and  the  resistor  admittances  ■  R^'* :  also  lie 
above  and  to  the  right  of  this  line.  Thus  no 
positive  linear  combination  of  these  admittances 
car.  vanish  as  required  by  eq.  (2)  . 

To  rule  out  solutions  in  the  open  right  half 
plane,  it  is  shown  by  a  homotopy  argument  that  the 
existence  of  such  a  solution  implies  the  existence 
of  a  network  satisfying  the  conditions  of  Thm.  1  and 
having  natural  frequencies  of  the  fora  s0  »  j-  *  0 
(already  shown  not  to  exist) .  Add  a  parallel 
conductance  G  to  each  element  of  the  network,  and 
call  the  parallel  element  pair  a  "composite 
element."  Consider  the  locus  of  the  natural  fre¬ 
quencies  as  G  is  increased  from  zero  to  arbitrarily 
high  values.  Eventually  they  must  all  enter  the 
open  left  half  plane  because  all  the  composite  ele¬ 
ments  become  strictly  passive  at  sufficiently  high  G 
values.  Since  the  network  started  out  with  at  least 
one  open  right  half  plane  natural  frequency,  and  the 
natural  frequencies  depend  continuously  on  G,  then 
there  exists  a  G>0  such  that  the  network  has  natural 
frequencies  of  the  form  s0  »  jw  4  0  f so»0  is  ruled 
out  by  the  strict  passivity  of  all  the  composite 
elements  here)  .  It  is  easily  verified  that  the 
collection  of  composite  network  elements  satisfies 
the  Thm.  1  conditions.  Thus,  open  right  half  plane 
natural  frequencies  are  ruled  out. 


IV.  STABILITY  RESULT  FOR  NETWORKS  WITH  NONLINEAR 
RESISTORS  AND  CAPACITORS 

The  previous  result  for  linear  networks  can 
afford  some  limited  insight  into  the  behavior  of 
nonlinear  networks.  First  the  nonlinear  equations 
are  linearized  about  an  equilibrium  point  and 
Theorem  1  is  applied  to  the  linear  model.  If  the 
linearized  model  is  stable,  then  the  equilibrium 
point  of  the  original  nonlinear  network  is  Locally 
stable,  i.e.,  the  network  will  return  to  that 


equilibrium  point  if  the  initial  condition  is 
sufficiently  near  it.  But  tr.e  result  m  this 
section,  m  contrast,  applies  to  tr.e  full  r.cr.l.r.eir 
circuit  model  and  allows  one  to  octo-oie  that  ir. 
certain  circumstances  the  r.etwcr.<  oar.r.ct  cscillate 
even  if  the  initial  state  is  amrrarily  far  from 
tr.e  equilibrium  point. 

Termmol  ogv 


We  say  that  a  function  y«fixi  loes  ir,  tr.e 
sector  ia.oi  if  a-x-  <  xnfx.1  ^  b*x-.  And  we  say 
that  an  impedance  Z  <  s .  satisfies  the  Popov  criterion 
if  (1  *  rslZis)  is  positive  real  for  some 

r>0.  Ncte  that  this  statement  of  the  ?cpcv  criterion 
differs  slightly  from  that  giver,  ir.  standard 
references  '.3 , 10!  . 


Theorem  2 


Consider  a  network  consisting  of  possibly  non¬ 
linear  resistors  and  capacitors  and  cells  with, 
linear  output  impedances  Zn!s),  n»l,2,...,N.  Suppose 

l)  the  resistor  curves  are  continuous  functions 
ik  »  g^v^)  wh6re  3k  lies  in  1:116  sector  [0,0-,^!, 
G_iax>0,  for  all  resistors, 

li)  the  capacitors  are  characterized  by 
i-k  *  -k(vk>*k  where  0  i  ck(vk>  I  raax  tor  ail  k  and 
vk ,  and 

iii)  the  impedances  Zn(s)  all  satisfy  the  Popov 
criterion  for  some  common  value  of  r>0.  Then 
the  network  is  stable  in  the  sense  that,  for  ar.v 
initial  condition, 

.00 

1  rn  _  , 

I  I  i  lk ;  t)  |  d  t  <  »  ;  3 1 

‘  -  all  resistors  - 
0  and  capacitors 
Outline  of  Proof 


By  Tellegen’s  theorem,  for  any  set  of  initial 
conditions  and  any  time  T>0, 

fT 

l  (vk(t)  +  r  vk(t))ik(t)  dt  * 

^  resistors 


|  I  (v^(t)  +  r  vk(t))ik(t)  dt  * 
j g  capacitors 

rT 

I  J"  (v.  (t)  +  r  v.  (t))i  (t)  d  t  =  0.  (4) 

I  ,  .  k  K  K 

°  cei 

impedances 


resistors,  multiplying  the  sector  inequality 
(V)  <  G-,-vv  by  h  >  0  yields  i‘«  ig(v)  <  CM,i'', 


For 

v  g(v)  <  GjujV4  by  £  0  yields  i‘»  ig(v)  <_  Gj^i 

and  hence 


-1  fT  .2, 


i.  (t)  dt  <  i.  (t)v  (t)  dt 


i.  (t)  [v.  (t)  +  r  vk(t))dt  -  r[*k(vk(t)  )-Jk(vk(0) 


(5) 


0 


where 


v 


ACKNOWLEDGEMENT 


5,  ’ v  =  ;v ' )  dv '  >0 


(6) 


is  the  resistor  :o—-or.zent .  Using  tr.e  inequality 
'6 )  in  i 5 )  yields 


3_,i  ii(t)dt-rs.  (v.  10)  }<  i.  (t!  Jv  (t)  +rvw  (t!  1  dt . 

nax-o  *  *  “  k  k  (?) 
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For  capacitors  ^  integrating  the  inequality 
i*  *  ck  ivk>  1  cmaxcklvk'*£  yields 
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.•T  , 

i,*;t:dt  <  r 
_  k  — 


r  T 


C,  (v,  !v.  ;t)dt 
„  k  k  k 


.  (t)  iv.  (t) 


r$  (t)]dt  -  (EMg^TI-E^q^fO! 


(8) 


where 

r? 

E  (q)  -  V,  (q 1  )  dq '  0  (9) 

*  Jo  * 

is  the  capacitor  energy.  Using  the  inequality  (9) 
in  (8)  yields 

(T  (T 

-r —  i  i,:(t)dt-E,  (q.  (0)  <  ik  (t)  [v  (t)  *rv  ( tl  1  dt. 
max  -‘o  *  k  J  0  *  (10) 


And  for  the  cells,  the  assumption  that  (l»rs)  2n  (s) 
is  positive  real  implies  that 


'  i 

i (t)  [v  (t) 

Jo  n 


rv  ( t)  ]  dt  >  -  E  (0)  , 

n  —  n 


(11) 


where  E  (0)  is  the  "initial  energy  in  the  cell's 
output  impedance"  at  t«0,  a  function  of  the  initial 
conditions  only.  Substituting  (7)  ,  (10)  and  (11) 
into  (4)  yields 

-l  fT  -  2  l’T  2 

G  )  ifltldt  ♦  — —  '  T  i.’(t)dt  < 

max  >  -  k  C  f  “  k  — 

-  0  max  J  0 

resistors  capacitors 
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r  l  yvk(0>>  *  l  Ek(qk(0)>  ♦  [  En(0)  ,  (12) 

resistors  capacitors  cells 

where  the  right  hand  side  is  a  function  only  of  the 
initial  conditions.  Thus  (3)  holds.  ^ 

V.  CONCLUDING  REMAPKS 


The  design  criteria  presented  here  are  simple 
and  practical,  though  at  present  their  validity  is 
restricted  to  linear  models  of  the  cells.  There 
are  several  areas  of  further  work  to  be  pursued, 
one  of  which  is  an  analysis  of  the  differentiator 
cell  that  includes  amplifier  clipping  effects. 

Others  include  the  synthesis  of  a  compensator  for 
the  differentiator  cell,  an  extension  of  the  non¬ 
linear  result  to  include  impedance  multipliers  other 
than  the  Popov  operator,  and  a  waveform  bounding 
analysis  of  the  network  which  would  guarantee 
adequate  convergence  after  an  allotted  settling 
time. 


MASSACHUSETTS  INSTITUTE  OF  TECHNOLOGY 


VLSI  PUBLICATIONS 


VLSI  Memo  No.  88-440 
March  1988 


REDUCING  THE  PARALLEL  SOLUTION  TIME  OF  SPARSE  CIRCUIT  MATRICES 
USING  REORDERED  GAUSSIAN  ELIMINATION  AND  RELAXATION 


David  Smart  and  Jacob  White 


Abstract 

Using  parallel  processors  to  reduce  the  execution  times  of  classical  circuit  simulation 
programs  like  SPICE  and  ASTAP  has  been  the  focus  of  much  current  research.  In  these 
efforts,  good  parallel  speed  increases  have  been  achieved  for  linearized  system 
construction,  but  it  has  been  difficult  to  get  good  parallel  speed  increases  for  sparse 
matrix  solution.  In  this  paper  we  examine  two  approaches  for  reducing  parallel  sparse 
matrix  solution  time;  the  first  based  on  pivot  ordering  algorithms  for  Gaussian 
elimination,  and  the  second  based  on  relaxation  algorithms.  In  the  section  on  Gaussian 
elimination  sparse  matrix  solution,  we  present  a  pivot  ordering  algorithm  which  increases 
the  parallelism  of  Gaussian  elimination  compared  to  the  commonly  used  Markowitz 
method.  The  performance  of  the  new  algorithm  is  compared  to  other  suggested 
ordering  algorithms  for  a  collection  of  circuit  examples.  The  minimum  number  of  parallel 
steps  for  the  solution  of  a  tridiagonal  matrix  is  derived,  and  it  is  shown  that  this  optimum 
is  nearly  achieved  by  the  ordering  heuristics  which  attempt  to  maximize  parallelism.  In 
the  section  on  relaxation,  we  present  an  optimality  result  about  Gauss-Jacobi  over 
Gauss-Seidel  relaxation  on  parallel  processors. 
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WAVEFORM  RELAXATION  APPLIED  TO  TRANSIENT  DEVICE  SIMULATION 


M.  Reichelt,  J.  White,  J.  Allen,  and  F.  Odeh 


Abstract 

In  this  paper  we  investigate  the  possibility  of  accelerating  the  transient  simulation  of  MOS 
devices  by  using  waveform  relaxation.  Standard  spatial  discretization  techniques  are 
used  to  generate  a  large,  sparsely-connected  system  of  algebraic  and  ordinary 
differential  equations  in  time.  The  waveform  relaxation  (WR)  algorithm  for  solving  such  a 
system  is  described,  and  several  theoretical  results  that  characterize  the  convergence  of 
WR  for  device  simulation  are  given.  In  addition,  one-dimensional  experimental  results 
are  presented. 
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OPTIMAL  SIMULATIONS  BY  BUTTERFLY  NETWORKS:  EXTENDED  ABSTRACT 
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Rosenberg 


Abstract 

We  investigate  the  power  of  the  Butterfly  network  (which  is  the  FFT  network  with  inputs 
and  outputs  identified)  relative  to  other  proposed  multicomputer  interconnection 
networks,  by  considering  how  efficiently  the  Butterfly  can  simulate  the  other  networks: 
Formally  we  ask.  How  efficiently  can  one  embed  the  graph  underlying  the  other  network  in 
the  graph  underlying  the  Butterfly?  We  measure  the  efficiency  of  an  embedding  of  a  graph 
G  in  a  graph  H  in  terms  of:  the  dilation,  or,  the  maximum  amount  that  any  edge  of  G  is 
"stretched"  by  the  embedding;  the  expansion,  or,  the  ratio  of  the  number  of  vertices  of  H  to 
the  number  of  vertices  of  G.  We  present  three  simulations  that  are  optimal,  to  within 
constant  factors:  (1)  Any  complete  binary  tree  can  be  embedded  in  a  Butterfly  graph,  with 
simultaneous  dilation  0(1)  and  expansion  0(1).  (2)  The  n-vertex  X-tree  can  be  embedded 
in  a  Butterfly  graph  with  simultaneous  dilation  0(log  log  n)  and  expansion  0(1);  no 
embedding  has  better  dilation,  independent  of  expansion.  (3)  Any  embedding  of  the  n  x  n 
mesh  in  the  Butterfly  graph  must  have  dilation  (log  n),  independent  of  expansion;  any 
embedding  of  the  mesh  in  the  Butterfly  graph  achieves  this  dilation.  Thus,  we  have 
simulations  of  complete-binary-tree  machines,  X-tree  machines,  and  mesh  computers  on 
Butterfly  machines,  that  are  optimal  in  resource  utilization  (expansion)  and  delay 
(dilation),  to  within  constant  factors. 
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Abstract 

VLSI  communication  networks  are  wire  limited.  The  cost  of  a  network  is  not  a  function  of 
the  number  of  switches  required,  but  rather  a  function  of  the  wiring  density  required  to 
construct  the  network.  This  paper  analyzes  communication  networks  of  varying  dimension 
under  the  assumption  of  constant  wire  bisection.  Expressions  for  the  latency,  average  case 
throughput,  and  hot-spot  throughput  of  Jt-ary  n-cube  networks  with  constant  bisection  are 
derived  that  agree  closely  with  experimental  measurements.  It  is  shown  that  low¬ 
dimensional  networks  (e.g.,  tori)  have  lower  latency  and  higher  hot-spot  throughput  than 
high-dimensional  networks  (e.g.,  binary  n-cubes)  with  the  same  bisection  width. 
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Abstract 


VLSI  communication  networks  are  wire  limited.  The  cost  of  a  network  is  not  a  function  of  the  number 
of  switches  required,  but  rather  a  function  of  the  wiring  density  required  to  construct  the  network.  This 
paper  analyzes  communication  networks  of  varying  dimension  under  the  assumption  of  constant  wire 
bisection.  Expressions  for  the  latency,  average  case  throughput,  and  hot-spot  throughput  of  Jr-ary  n- 
cube  networks  with  constant  bisection  are  derived  that  agree  closely  with  experimental  measurements. 
It  is  shown  that  low-dimensional  networks  (e.g.,  tori)  have  lower  latency  and  higher  hot-spot  throughput 
than  high-dimensional  networks  (e.g.,  binary  »-cubee)  with  the  same  bisection  width. 
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1  Introduction 

The  critical  component  of  a  concurrent  computer  is  its  communication  network.  Many  al¬ 
gorithms  are  communication  rather  than  processing  limited.  Fine-grain  concurrent  programs 
execute  as  few  as  10  instructions  in  response  to  a  message  [5].  To  efficiently  execute  such  pro¬ 
grams  the  communication  network  must  have  a  latency  no  greater  than  about  10  instruction 
times,  and  a  throughput  sufficient  to  permit  a  large  fraction  of  the  nodes  to  transmit  simul¬ 
taneously.  Low-latency  communication  is  also  critical  to  support  code  sharing  and  garbage 
collection  across  nodes. 

‘The  research  described  ia  this  paper  was  supported  ia  part  by  the  Defease  Advanced  Research  Projects 
Agency  under  contracts  N 00014- SO- C- OS 22  and  N 000 1 4-85- K-0 124  and  ia  part  by  a  National  Science  Foundation 
Presidential  Young  Investigator  Award  with  watching  funds  from  General  Electric  Corporation. 

1 A  preliminary  version  of  this  paper  appeared  in  the  proceedings  of  the  1987  Stanford  Conference  on  Advanced 
Research  ia  VLSI  [8]. 
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As  the  grain  size  of  concurrent  computers  continues  to  decrease,  communication  latency  becomes 
a  more  important  factor.  The  diameter  of  the  machine  grows,  messages  are  sent  more  frequently, 
and  fewer  instructions  are  executed  in  response  to  each  message.  Low  latency  is  more  difficult 
|o  achieve  in  a  fine-grain  machine  because  the  available  wiring  space  grows  more  slowly  than 
the  expected  traffic.  Since  the  machine  must  be  constructed  in  three  dimensions,  the  bisection 
area  grows  only  a a  JV»  while  traffic  grows  at  least  as  fast  as  N ,  the  number  of  nodes. 

VLSI  systems  are  wire  limited.  The  cost  of  these  systems  is  predominantly  that  of  connecting 
devices,  and  *he  performance  i«  limited  by  the  delay  of  there  interconnections.  Thus,  to  achieve 
the  required  performance,  the  network  must  make  efficient  use  of  the  available  wire.  The 
topology  of  the  network  must  map  into  the  three  physical  dimensions  so  that  messages  are 
not  required  to  double  back  on  themselves,  and  in  a  way  that  allows  messages  to  use  all  of  the 
available  bandwidth  along  their  path. 

This  paper  considers  the  problem  of  constructing  wire-efficient  communication  networks,  net¬ 
works  that  give  the  optimum  performance  for  a  given  wire  density.  We  compare  networks 
holding  wire  bisection,  the  number  of  wires  crossing  a  cut  that  evenly  divides  the  machine,  con¬ 
stant.  Thus  we  compare  low  dimensional  networks  with  wide  communication  channels  against 
high  dimensional  networks  with  narrow  channels.  We  investigate  the  class  of  fc-ary  n-cube  in¬ 
terconnection  networks  and  show  that  low-dimensional  networks  out  perform  high-dimensional 
networks  with  the  same  bisection  width. 

The  remainder  of  this  paper  describes  the  design  of  wire-efficient  communication  networks. 
Section  2  describes  the  assumptions  on  which  this  paper  is  based.  The  family  of  k- ary  n-cube 
netwo:  xs  is  described  in  Section  2.1.  We  restrict  our  attention  to  ifc-ary  n-cubes  because  it  is  the 
dimension  of  the  network  that  is  important,  not  the  details  of  its  topology.  Section  2.2  introduces 
wormhole  routing  [18],  a  low-latency  routing  technique.  Network  cost  is  determined  primarily 
by  wire  density  which  we  will  measure  in  terms  of  bisection  width.  Section  2.3  introduces  the 
idea  of  bisection  width ,  and  discusses  delay  models  for  network  channels.  A  performance  model 
of  these  networks  is  derived  in  Section  3.  Expressions  are  given  for  network  latency  as  a  function 
of  traffic  that  agree  closely  with  experimental  results.  Under  the  assumption  of  constant  wire 
density,  it  is  shown  that  low-dimensional  networks  achieve  lower  latency  and  better  hot-spot 
throughput  than  do  high-dimensional  networks. 


2  Preliminaries 

2.1  Jfe-ary  n-cubes 

Many  different  network  topologies  have  been  proposed  for  use  in  concurrent  computers:  trees 
[4]  [13]  [19],  Benes  networks[3],  Batcher  sorting  networks  [1],  shuffle  exchange  networks  [21], 
Omega  networks  [12],  indirect  binary  n-cube  or  flip  networks  [2]  [20],  and  direct  binary  n-cubes 
[17],  [15],  [22].  The  binary  n-cube  is  a  special  case  of  the  family  of  ife-ary  n-cubee,  cubes  with  n 
dimensions  and  k  nodes  in  each  dimension. 

Most  concurrent  computers  have  been  built  using  networks  that  are  either  it- ary  n-cubee  or 
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Figure  1:  A  Binary  6-Cube  Embedded  in  the  Plane 

are  isomorphic  to  fc-ary  n-cubes:  rings,  meshes,  tori,  direct  and  indirect  binary  n-cubes,  and 
Omega  networks.  Thus,  in  this  paper  we  restrict  our  attention  to  Jb-ary  n-cube  networks.  We 
refer  to  n  as  the  dimension  of  the  cube  and  k  as  the  radix.  Dimension,  radix,  and  number  of 
nodes  are  related  by  the  equation 

N  =  kn,  (k  =  VN,  n  =  log*  N)  .  (1) 

It  is  the  dimension  of  the  network  that  is  important,  not  the  details  of  its  topology. 

A  node  in  a  A:- ary  n-cube  can  be  identified  by  an  n-dixit  radix  k  address,  ao,. .  .,an-i-  The 
»th  digit  of  the  address,  a,-,  represents  the  nodes  povf,on  in  the  i*1*  dimension.  Each  node 
can  forward  messages  to  its  upper  neighbor  in  each  dimension,  t,  with  address,  <zo,...,<z,  + 
l(mod 

In  this  paper  we  assume  that  our  k- ary  n-cube  are  unidirectional  for  simplicity.  We  will  see 
that  our  results  do  not  change  appreciably  for  bidirectional  networks.  For  an  actual  machine, 
however,  there  are  many  compelling  reasons  to  make  our  networks  bidirectional.  Most  impor¬ 
tantly,  bidirectional  networks  allow  us  to  exploit  locality  of  communication.  If  am  object,  A, 
sends  a  message  to  an  object,  B,  there  is  a  high  probability  of  B  sending  a  message  back  to  A. 
In  a  bidirectional  network,  a  round  trip  from  A  to  B  can  be  made  short  by  placing  A  and  B 
close  together.  In  a  unidirectional  network,  a  round  trip  will  always  involve  completely  circling 
the  machine  in  at  least  one  dimension. 

Figures  1-3  show  three  Jb-ary  n-cube  networks  in  order  of  decreasing  dimension.  Figure  1 
shows  a  binary  6-cube  (64  nodes).  A  3-ary  4-cube  (81  nodes)  is  shown  in  Figure  2.  An  8- 
ary  2-cube  (64  nodes),  or  torus,  is  shown  in  Figure  3.  Each  line  in  Figure  1  represents  two 
communication  channels,  one  in  each  direction,  while  each  line  in  Figures  2  and  3  represents  a 
single  communication  channel. 


Figure  2:  A  Ternary  4-Cube  Embedded  in  the  Plane 


Figure  3:  An  8-ary  2-Cube  (Torus) 
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Figure  4:  Latency  of  atore-and-forward  routing  (top)  va.  wormhole  routing  (bottom). 

2.2  Wormhole  Routing 

In  thia  paper  we  consider  networks  that  use  wormhol^lS]  rather  than  ston-and-forward  [23] 
routing.  Instead  of  storing  a  packet  completely  in  a  node  and  then  transmitting  it  to  the  next 
node,  wormhole  routing  operates  by  advancing  the  head  of  a  packet  directly  from  incoming  to 
outgoing  channels.  Only  a  few  flow  control  digits  (flits)  are  buffered  at  each  node.  A  flit  is  the 
smallest  unit  of  information  that  a  queue  or  channel  can  accept  or  refuse. 

Aa  soon  as  a  node  examines  the  header  flit(s)  of  a  message,  it  selects  the  next  channel  on 
the  route  and  begins  forwarding  flits  down  that  channel.  As  flits  are  forwarded,  the  message 
becomes  spread  out  across  the  channels  between  the  source  and  destination.  It  is  possible  for 
the  first  flit  of  a  message  to  arrive  at  the  destination  node  before  the  last  flit  of  the  message 
has  left  the  source.  Because  most  flits  contain  no  routing  information,  the  flits  in  a  message 
must  remain  in  contiguous  channels  of  the  network  and  cannot  be  interleaved  with  the  flits  of 
other  messages.  When  the  header  flit  of  a  message  is  blocked,  all  of  the  flits  of  a  message  stop 
advancing  and  block  the  progress  of  any  other  message  requiring  the  channels  they  occupy. 

A  method  similar  to  wormhole  routing,  called  virtual  cut-through,  is  described  in  [11].  Virtual 
cut-through  differs  from  wormhole  routing  in  that  it  buffers  messages  when  they  block,  removing 
them  from  the  network.  With  wormhole  routing,  blocked  messages  remain  in  the  network. 
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Figure  4  illustrates  the  advantage  of  wormhole  routing.  There  are  two  components  of  latency, 
distance  and  message  aspect  ratio.  The  distance,  D,  is  the  number  of  hops  required  to  get  from 
the  source  to  the  destination.  The  message  aspect  ratio  (message  '  -ngth,  L,  normalized  to  the 
channel  width,  W)  is  the  number  of  channel  cycles  required  to  transmit  the  message  across  one 
channel.  The  top  half  of  the  figure  shows  store- and- forward  routing.  The  message  is  is  entirely 
transmitted  from  node  No  to  node  N\,  then  from  N\  to  Nj  and  so  on.  With  store-and- forward 
routing,  latency  is  the  product  of  D,  and 


Tsf  =  T.  (d  x  A)  .  (2) 

The  bottom  half  of  Figure  4  shows  wormhole  routing.  As  soon  as  a  flit  arrives  at  a  node,  it  is 
forwarded  to  the  next  node.  With  wormhole  routing  latency  is  reduced  to  the  sum  of  D  and 


Twh  =  Te  (d  +  .  (3) 

In  both  of  these  equations,  Tc  is  the  channel  cycle  time,  the  amount  of  time  required  to  perform 
a  transaction  on  a  channel. 

2.3  VLSI  Complexity 

VLSI  computing  systems  [14]  are  wire-limited;  the  complexity  of  what  can  be  constructed  is 
limited  by  wire  density,  the  speed  at  which  a  machine  can  run  is  limited  by  wire  delay,  and 
the  majority  of  power  consumed  by  a  machine  is  used  to  drive  wires.  Thus,  machines  must 
be  organized  both  logically  and  physically  to  keep  wires  short  by  exploiting  locality  wherever 
possible.  The  VLSI  architect  must  organize  a  computing  system  so  that  its  form  (physical 
organization)  fits  its  function  (logical  organization). 

Networks  have  traditionally  been  analyzed  under  the  assumption  of  constant  channel  band¬ 
width.  Under  this  assumption  each  channel  is  one  bit  wide  ( W  =  1)  and  has  unit  delay 
(Tc  =  1).  The  constant  bandwidth  assumption  favors  networks  with  high  dimensionality  (e.g., 
binary  n-cubes)  over  low-dimensional  networks  (e.g.,  tori).  This  assumption,  however,  is  not 
consistent  with  the  properties  of  VLSI  technology.  Networks  with  many  dimensions  require 
more  and  longer  wires  than  do  low-dimensional  networks.  Thus,  high-dimensional  networks 
cost  more  and  run  more  slowly  than  low-dimensional  networks.  A  realistic  comparison  of  net¬ 
work  topology  must  take  both  wire  density  and  wire  length  into  account. 

To  account  for  wire  density,  we  will  use  bisection  width  [24]  as  a  measure  of  network  cost.  The 
bisection  width  of  a  network  is  the  minimum  number  of  wires  cut  when  the  network  is  divided 
into  two  equal  halves.  Rather  than  comparing  networks  with  constant  channel  width,  W ,  we 
will  compare  networks  with  constant  bisection  width.  Thus,  we  will  compare  low-dimensional 
networks  with  large  W  with  high-dimensional  networks  with  small  W. 
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The  delay  of  a  wire  depends  on  its  length,  /.  For  short  wires,  the  delay,  t„  is  limited  by  charging 
the  capacitance  of  the  wire  and  varies  logarithmically  with  wire  length. 

*•  =  Tfcrelog,  Kl,  (4) 

where  rmv  is  the  inverter  delay,  and  AT  is  a  constant  depending  on  capacitance  ratios. 

For  long  wires,  delay,  */,  is  limited  by  the  speed  of  light. 


ti  = 


(5) 


In  this  paper  we  will  consider  three  delay  models:  constant  delay,  Tc  independent  of  length, 
logarithmic  delay,  Tc  oc  log/,  and  linear  delay,  Te  oc  /.  Oar  main  result,  that  latency  is  minimized 
by  low-dimensional  networks,  is  supported  by  all  three  models. 

3  Performance  Analysis 

In  this  section  we  compare  the  performance  of  unidirectional  fc-ary  n-cube  interconnection 
networks  using  the  following  assumptions: 

•  Networks  must  be  embedded  into  the  plane.  If  a  three-dimensional  packaging  technology 
becomes  available,  the  comparison  changes  only  slightly. 
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•  Nodes  are  placed  systematically  by  embedding  ^  logical  dimensions  in  each  of  the  two 
physical  dimensions.  We  assume  that  both  n  and  fc  are  even  integers.  The  long  end- 
around  connections  shown  in  Figure  3  can  be  avoided  by  folding  the  network  as  shown  in 
Figure  5. 

•  For  networks  with  the  same  number  of  nodes,  wire  density  is  held  constant.  Each  network 
is  constructed  with  the  same  bisection  width,  B,  the  total  number  of  wires  crossing  the 
midpoint  of  the  network.  To  keep  the  bisection  width  constant,  we  vary  the  width,  W,  of 

*he  communication  channels.  We  normalize  to  the  bisection  width  of  a  bit-serial  {W  -  11 
binary  n-cube. 

•  The  networks  use  wormhole  routing. 

•  Channel  delay,  Tc,  is  a  function  of  wire  length,  /.  We  begin  by  considering  channel  delay 
to  be  constant.  Later,  the  comparison  is  performed  for  both  logarithmic  and  linear  wire 
delays;  Te  oc  log  l  and  Te  a  /. 

When  k  is  even,  the  channels  crossing  the  midpoint  of  the  network  are  all  in  the  highest 
dimension.  For  each  of  the  y/N  rows  of  the  network,  there  are  fc(  “-1)  of  these  channels  in  each 
direction  for  a  total  of  2 y/Wk($~l)  channels.  Thus,  the  bisection  width,  B,  of  a  /fc-ary  n-cube 
with  W-bit  wide  communication  channels  is 


B(k,n)  =  2  W>/Nk(*~l)  = 

k 


(6) 


For  a  binary  n-cube,  *  =  2,  the  bisection  width  is  B( 2,  n)  =  WN.  We  set  B  equal  to  N  to 
normalize  to  a  binary  n-cube  with  unit  width  channels,  W  =  1.  The  channel  width,  W{k,  n), 
of  a  fc-ary  n-cube  with  the  same  bisection  width,  B,  follows  from  (6): 


(7) 


The  peak  wire  density  is  greater  than  the  bisection  width  in  networks  with  n  >  2  because  the 
lower  dimensions  contribute  to  wire  density.  The  maximum  density,  however,  is  bounded  by 


a_i 

ZW  =  2 wVn  Y,  *’  =  k'fi*  £  fc‘  =  kVN 

iaO  iaO 

A  plot  of  wire  density  as  a  function  of  position  for  one  row  of  a  binary  20-cube  is  shown  in 
Figure  6.  The  density  is  very  low  at  the  edges  of  the  cube  and  quite  dense  near  the  center. 
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Figure  6:  Wire  Density  vs.  Position  for  One  Row  of  a.  Binary  20-Cube 


The  peak  density  for  the  row  is  1364  at  position  341.  Compare  this  density  with  the  bisection 
width  of  the  row,  which  is  1024.  In  contrast,  a  two-dimensional  torus  has  a  wire  density  of 
1024  independent  of  position.  One  advantage  of  high-radix  networks  is  that  they  have  a  very 
uniform  wire  density.  They  make  full  use  of  available  area. 

Each  processing  node  connects  to  2n  channels  (n  input  and  »  output)  each  of  which  is  §  bits 
wide.  Thus,  the  number  of  pins  per  processing  node  is 


Np  =  nk.  (9) 

A  plot  of  pin  density  as  a  function  of  dimension  for  N  =  256,  16K  and  1M  nodes3  is  shown 
in  Figure  7.  Low-dimensional  networks  have  the  disadvantage  of  requiring  many  pins  per 
processing  node.  A  two-dimensional  network  with  1M  nodes  (not  shown)  requires  2048  pins 
and  is  clearly  unrealizable.  However,  the  number  of  pins  decreases  very  rapidly  as  the  dimension, 
n,  increases.  Even  for  1M  nodes,  a  dimension  4  node  has  only  128  pins.  All  of  the  configurations 
that  give  low  latency  also  give  a  reasonable  pin  count. 


3.1  Latency 


Latency,  Ti,  is  the  sum  of  the  latency  due  to  the  network  and  the  latency  due  to  the  processing 
node, 

Ti  ~  Toga  +  Taodg.  ( io) 

*IK  m  1024  and,  IM  m  IK  x  IK  »  1048579. 
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Figure  7:  Pin  Density  vs.  Dimension  for  256,  16K,  and  1M  Nodes 

In  this  paper  we  are  concerned  only  with  T**.  Techniques  to  reduce  Tnoaa  are  described  in  [5] 
and  [9]. 

If  we  select  two  processing  nodes,  Pi,  Pj,  at  random,  the  average  number  of  channels  that  must 
be  traversed  to  send  a  message  from  Pi  to  Pj  is  given  by 

D  =  (ifi)n-  (u> 

The  average  latency  of  a  fc-ary  n-cube  is  calculated  by  substituting  (7)  and  (11),  into  (3) 

r-'r-((x)"+T)-  <12> 

Figure  8  shows  the  average  network  latency,  To*,  as  a  function  of  dimension,  n,  for  Jfe-ary 
n-cubes  with  2*  (256),  214  (16K),  and  230  (1M)  nodes4.  The  left  most  data  point  in  this 
figure  corresponds  to  a  torus  (n  =  2)  and  the  right  most  data  point  corresponds  to  a  binary 
n-cube  (k  =  2).  This  figure  assumes  constant  wire  delay,  Tc,  and  a  message  length,  L,  of 
150  bits.  This  choice  of  message  length  was  based  on  the  analysis  of  a  number  of  fine-grain 
concurrent  programs  [5].  Although  constant  wire  delay  is  unrealistic,  this  figure  illustrates  that 
even  ignoring  the  dependence  of  wire  delay  on  wire  length,  low-dimensional  networks  achieve 
lower  latency  than  high-dimensional  networks. 

4  For  the  take  of  comparison  we  allow  radix  to  take  on  non-integer  values.  For  some  of  the  dimensions 
considered,  there  is  no  integer  radix,  k,  that  pres  the  correct  number  of  nodes.  In  fact,  this  limitation  can  be 
overcome  by  constructing  a  mixed- rude*  cube. 
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Figure  8:  Latency  vs.  Dimension  for  256,  16K,  and  1M  Nodes,  Constant  Delay 

The  latency  of  the  tori  on  the  left  side  of  Figure  8  is  limited  almost  entirely  by  distance.  The 
latency  of  the  binary  n-cubea  on  the  right  side  of  the  graph  is  limited  almost  entirely  by  aspect 
ratio.  With  bit  serial  channels,  these  cubes  take  150  cycles  to  transmit  their  messages  across  a 
single  channel. 

In  an  application  that  exploits  locality  of  communication,  the  distance  between  communicating 
objects  is  reduced.  In  such  a  situation,  the  latency  of  the  low-dimensional  networks  (the  left  side 
of  Figure  8)  is  reduced.  High-dimensional  networks,  on  the  other  hand,  cannot  take  advantage 
of  locality.  Their  latency  will  remain  high. 

In  applications  that  send  short  messages,  the  component  of  latency  due  to  message  length  is 
reduced  resulting  in  lower  latency  for  high-dimensional  networks  (the  right  side  of  Figure  8). 

In  general  the  lowest  latency  is  achieved  when  the  component  of  latency  due  to  distance,  D, 
and  the  component  due  to  message  length,  are  approximately  equal,  D  »  For  the  three 
cases  shown  in  Figure  8,  minimum  latencies  are  achieved  for  n  =  2,  4,  and  5  respectively. 

The  longest  wire  in  the  system  becomes  a  bottleneck  that  determines  the  rate  at  which  each 
channel  operates,  Te.  The  length  of  this  wire  is  given  by 

/  =  **-*.  (13) 

If  the  wires  are  sufficiently  short,  delay  depends  logarithmically  on  wire  length.  If  the  channels 
are  longer,  they  become  limited  by  the  speed  of  light,  and  delay  depends  linearly  on  channel 
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length.  Substituting  (13)  into  (4)  and  (5)  gives 

f  1  +  log,  /  =  1  +  ^  -  1^  log,  k  logarithmic  delay 


r£«  < 


(14) 


/  =  fc?'1 


linear  delay. 


We  substitute  (14)  into  (12)  to  get  the  network  latency  for  these  two  cases: 


Doc  l 


(i  +  (^  -  i)  log,  kj  ( n  +  yp)  logarithmic  delay 
(fca-1)  n  +  linear  delay. 


(15) 


Figure  9  shows  the  average  network  latency  as  a  function  of  dimension  for  k- ary  n-cubes  with 
2s  (256),  214  (16K),  and  230  (1M)  nodes,  assuming  logarithmic  wire  delay  and  a  message  length, 
L,  of  150.  Figure  10  shows  the  same  data  assuming  linear  wire  delays.  In  both  figures,  the  left 
most  data  point  corresponds  to  a  torus  (n  =  2)  and  the  right  most  data  point  corresponds  to 
a  binary  n-cube  (k  =  2). 

In  the  linear  delay  case.  Figure  10,  a  torus  (n  =  2)  always  gives  the  lowest  latency.  This 
is  because  a  torus  offers  the  highest  bandwidth  channels  and  the  most  direct  physical  route 
between  two  processing  nodes.  Under  the  linear  delay  assumption,  latency  is  determined  solely 
by  bandwidth  and  by  the  physical  distance  traversed.  There  is  no  advantage  in  having  long 
channels. 

Under  the  logarithmic  delay  assumption,  Figure  9,  a  torus  has  the  lowest  latency  for  small 
networks  (N  =  256).  For  the  larger  networks,  the  lowest  latency  is  achieved  with  slightly  higher 
dimensions.  With  N  =  16 K,  the  lowest  latency  occurs  when  n  is  three5.  With  N  —  1 M,  the 
lowest  latency  is  achieved  when  n  is  5.  It  is  interesting  that  assuming  constant  wire  delay  does 
not  change  this  result  much.  Recall  that  under  the  (unrealistic)  constant  wire  delay  assumption, 
Figure  8,  the  minimum  latencies  are  achieved  with  dimensions  of  2,  4,  and  5  respectively. 

The  results  shown  in  Figures  9  through  8  were  derived  by  comparing  networks  under  the 
assumption  of  constant  wire  cost  to  a  binary  n-cube  with  W  =  1.  For  small  networks  it  is 
possible  to  construct  binary  n-cubes  with  wider  channels,  and  for  large  networks  (e.g.,  1  Af 
nodes)  it  may  not  be  possible  to  construct  a  binary  n-cube  at  all.  The  available  wiring  area 
grows  as  IV  s  while  the  bisection  width  of  a  binary  n-cube  grows  as  N.  In  the  case  of  small 
networks,  the  comparison  against  binary  n-cubes  with  wide  channels  can  be  performed  by 
expressing  message  length  in  terms  of  the  binary  n-cube’s  channel  width,  in  effect  decreasing 
the  message  length  for  purposes  of  comparison.  The  net  result  is  the  same:  lower-dimensional 
networks  give  lower  latency.  Even  if  we  perform  the  256  node  comparison  against  a  binary 
n-cube  with  W  =  16,  the  torus  gives  the  lowest  latency  under  the  logarithmic  delay  model, 
and  a  dimension  3  network  gives  minimum  latency  under  the  constant  delay  model.  For  large 

JIn  u  actual  machine  the  dimension  *  would  be  restricted  to  be  as  ere a  integer. 
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networks,  the  available  wire  is  less  than  assumed,  so  the  effective  message  length  should  be 
increased  .aking  low-dimensional  networks  look  even  more  favorable. 

In  this  comparison  we  have  assumed  that  only  a  single  bit  of  information  is  in  transit  on  each 
wire  of  the  network  at  a  given  time.  Under  this  assumption,  the  delay  between  nodes,  Tc,  is 
equal  to  the  period  of  each  node,  Tp.  In  a  network  with  long  wires,  however,  it  is  possible  to 
have  several  bits  in  transit  at  once.  In  this  case,  the  channel  delay,  Te,  is  a  function  of  wire 
length,  while  the  channel  period,  Tp  <  Tc ,  remains  constant.  Similarly,  in  a  network  with  very 
short  wires  we  may  allow  a  bit  to  ripple  through  several  channels  before  sending  the  next  bit. 
In  this  case,  Tp  >  Tc.  Separating  the  coefficients,  Te  and  Tp,  (3)  becomes 

Tnet  —  ( TeD  +  Tp  — ^  .  (16) 

The  net  effect  of  allowing  Te  ^  Tp  is  the  same  as  changing  the  length,  L,  by  a  factor  of  If  and 
does  not  change  our  results  significantly. 

When  wire  cost  is  considered,  low-dimensional  networks  (e.g.,  tori)  offer  lower  latency  than 
high-dimensional  networks  (e.g.,  binary  n-cubes).  Intuitively,  tori  outperform  binary  n-cubes 
because  they  better  match  form  to  function.  The  logical  and  physical  graphs  of  the  torus  are 
identical;  Thus,  messages  always  travel  the  minimum  distance  from  source  to  destination.  In  a 
binary  n-cube,  on  the  other  hand,  the  fit  between  form  and  function  is  not  as  good.  A  message 
in  a  binary  n-cube  embedded  into  the  plane  may  have  to  traverse  considerably  more  than  the 
minimum  distance  between  its  source  and  destination. 


3.2  Throughput 


Throughput,  another  important  metric  of  network  performance,  is  defined  as  the  total  number 
of  messages  the  network  can  handle  per  unit  time.  One  method  of  estimating  throughput  is  to 
calculate  the  capacity  of  a  network,  the  total  number  of  messages  that  can  be  in  the  network 
at  once.  Typically  the  maximum  throughput  of  a  network  is  some  fraction  of  its  capacity.  The 
network  capacity  per  node  is  the  total  bandwidth  ont  of  each  node  divided  by  the  average 
number  of  channels  traversed  by  each  message.  For  fc-ary  n-cubes,  the  bandwidth  out  of  each 
node  is  nW,  and  the  average  number  of  channels  traversed  is  given  by  (11),  so  the  network 
capacity  per  node  is  given  by 


nW 

-a: 

i 

D 

M 

n 

»  1. 


(17) 


The  network  capacity  is  independent  of  dimension.  For  a  constant  wire  density,  there  is  a 
constant  network  capacity. 


Throughput  will  be  less  than  capacity  because  contention  causes  some  channels  to  block.  This 
contention  also  increases  network  latency.  To  simplify  the  analysis  of  this  contention,  we  make 
the  following  assumptions: 
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Figure  11:  Contention  Model  for  A  Network 

•  Messages  are  routed  using  e-cube  routing  (in  order  of  decreasing  dimension)  [6].  That  is 

a  message  at  node  ao, ■ . ■ , On_i  destined  for  node  bo,..., 6„_ 1  is  first  routed  in  dimension 
n  -  1  until  it  reaches  node  ao,  ■ .  The  message  is  then  routed  in  dimension 

n  -  2  until  it  reaches  node  oq,  . . . ,an-3,bn-2,bn- 1,  and  so  on.  As  shown  in  Figure  11, 
this  assumption  allows  us  to  consider  the  contention  in  each  dimension  separately. 

•  The  traffic  from  each  node  is  generated  by  a  Poisson  process  with  arrival  rate 

•  Message  destinations  are  uniformly  distributed  and  independent. 

The  arrival  rate  of  corresponds  to  Ag  =  At  the  destination,  each  flit  is  serviced 

as  soon  as  it  arrives,  so  the  service  time  at  the  sink  is  To  =  ^  =  *£.  Starting  with  To  we  will 
calculate  the  service  time  seen  entering  each  preceding  dimension. 

For  convenience,  we  will  define  the  following  quantities: 

1 

7  = 

As  =  i\b, 

=  (1  -  7 )*E> 

*5S  =  7 

* SR  =  7(!  “  7  )*£. 

=  7(l-7)*£,and 
^RR  ~  (1  ~  l)2*E- 

Consider  a  single  dimension,  *,  of  the  network  as  shown  in  Figure  12.  All  messages  incur 
a  latency,  Tg,  due  to  contention  on  entering  the  dimension.  Those  messages  that  are  routed 
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Ti+ 1 


incur  an  additional  latency,  Tjh,  due  to  contention  during  routing.  The  rate  A e  message  stream 
entering  the  dimension  is  composed  of  two  components:  a  rate  A s  stream  that  skipped  the 
previous  (» +  1st)  dimension,  and  a  rate  Xr  stream  that  was  routed  in  the  previous  dimension. 
These  two  streams  are  in  turn  split  into  components  that  will  skip  the  Ith  dimension  (A 55  and 
Xrs)  and  components  that  will  be  routed  in  the  Ith  dimension  (A5/1  and  Xrr).  The  entering 
latency  seen  by  one  component  (say  Xrr)  is  given  by  multiplying  the  probability  of  a  collision 
(in  this  case  Aj/jTV+i)  by  the  expected  latency  due  to  a  collision,  (in  this  case  The 

components  that  require  routing  must  also  add  the  latency  due  to  contention  during  routing, 
Adding  up  the  four  components  with  appropriate  weights  gives  the  following  equation  for 

T,+i. 


Ti+i  =  TV  +  (1  -  7)1*  +  7(1  -  +  Tm)  +  73(1  -  7)A*r,.  (19) 

For  large  k,  gamma  is  small  and  the  latency  is  approximated  by  TV+i  w  TV  +  T*.  For  k  =  2 
(binary  n-cubes),  T*  =  0;  thus,  IV+i  -  Ti  +  ^1. 

To  calculate  the  routing  latency,  T*,  we  use  the  model  shown  in  Figure  13.  Given  that  a  message 
is  to  be  routed  in  a  dimension,  the  expected  number  of  channels  traversed  by  the  message  is 
|,  one  entering  channel  and  a  —  continuing  channels.  Thus,  the  average  message  rate  on 
channels  continuing  in  the  dimension  is  Xq  -  <tXr.  Using  virtual  channels  and  e-cube  routing, 
the  actual  continuting  rate  on  the  7th  channel  (outer  spiral)  is  A cj  —  ( j  -  lsj|l)A/|.  To  calculate 
Tr  we  need  only  the  average  rate. 

The  service  time  in  the  last  continuing  channel  in  dimension  t  is  =  TV.  Once  we  know 

the  service  time  for  the  >‘h  channel,  T,y,  the  additional  service  time  due  to  contention  at  the 
j  —  1'*  channel  is  given  by  multiplying  the  probability  of  a  collision,  A/1IV0,  by  the  expected 
waiting  time  for  a  collision,  3*.  Repeating  this  calculation  a  times  gives  us  TVo- 
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a*<  r.+t  A/*,  j1, 


»+i 


a*,  r. 


i+i 


t 

A/i,  7; 


Figure  13:  Contention  Model  for  Routing  Latency 


TiU-i)  =  T,,  +  ^pL, 

Tx  =  Ti  +  =  Ti  H-  (20) 

~  1  -  V^1  -  2Ac7t 
Ac 

Equation  (20)  is  valid  only  when  Ac  <  If  the  message  rate  is  higher  than  this  limit,  there 
is  no  steady-state  solution  and  latency  becomes  infinite.  There  are  two  solutions  to  (20).  Here 
we  consider  only  the  smaller  of  the  two  latencies.  The  larger  solution  corresponds  to  a  state 
that  is  not  encountered  during  normal  operation  of  a  network. 

To  calculate  T/y  we  also  need  to  consider  the  possibility  of  a  collision  on  the  entering  channel. 

r»  =  r<0(i  +  ^2)-rj.  (21) 

If  sufficient  queueing  is  added  to  each  network  node,  the  service  times  do  not  increase,  only  the 
latency  and  equations  (21)  and  (19)  become. 

r»=(n^r)  (1+¥0-r"  <22> 

Ti+l  =  T,  +  (1  -  7 )Tk  +  (7(1  -  7)3  +  73(1  -  7))  A*r0.  (23) 

To  be  effective,  the  total  queueing  between  the  source  and  destination  should  be  greater  than 
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Figure  14:  Latency  vs.  Traffic  (A)  for  IK  node  networks:  32-ary  2-cube,  4-ary  5-cube  and 
a  SmdltoCUb€’  L=20°blt8'  S°Ud  Un*  “  pr<sdicted  latency>  «e  measurements  taken’from 


the  expected  increase  in  latency  due  to  blocking.  One  or  two  flits  of  queueing  per  stage  is 
usually  sufficient.  The  analysis  here  is  pessimistic  in  that  it  assumes  no  queueing. 


Using  equation  (19),  we  can  determine  (1)  the  maximum  throughput  of  the  network  and  (2) 
how  network  latency  increases  with  traffic.  ^ 


Figures  14  and  15  snow  how  latency  increases  as  a  function  of  applied  traffic  for  IK  node  and  4K 
node  *-ary  n-cubes.  The  vertical  axis  shows  latency  in  cycles.  The  horizontal  axis  is  traffic  per 
tn° f’.m  The  figures  compare  measurements  from  a  network  simulator  (points) 

the  latency  predicted  by  (21)  (lines).  The  simulation  agrees  with  the  prediction  within  a  few 
percent  until  the  network  approaches  saturation. 


For  IK  networks,  a  32-ary  2-cube  always  gives  the  lowest  latency.  For  4K  networks,  a  16- 
ary  3-cube  gives  the  lowest  latency  when  A  <  0.2.  Because  latency  increases  more  slowly  for 
2-dimensional  networks,  a  64-ary  2-cube  gives  the  lowest  latency  when  A  >  0.2. 

At  the  left  side  of  each  graph  (A  =  0),  latency  is  given  by  (12).  As  traffic  is  applied  to  the 
network  latency  increases  slowly  due  to  contention  in  the  network  until  saturation  is  reached, 
aturation  occurs  when  A  is  between  0.3  and  0.5  depending  on  the  network  topology.  Networks 
snould  be  designed  to  operate  on  the  flat  portion  of  the  curve  (A  <  0.25). 
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Figure  15:  Latency  vs.  Traffic  (A)  for  4K  node  networks:  64-ary  2-cube,  16-ary  3-cube,  8-ary 
4-cube,  4-ary  6-cube,  and  binary  12-cube,  L=200bits.  Solid  line  is  predicted  latency,  points  are 
measurements  taken  from  a  simulator. 


Figure  16:  Actual  Traffic  vi.  Attempted  Traffic  for  IK  node  networks:  32-ary  2-cube,  4-ary 
5-cube,  and  binary  10-cube,  L=200bits. 


Figure  17:  Actual  Traffic  vs.  Attempted  Traffic  for  4K  node  networks:  64-ary  2-cube,  16-ary 
3-cube,  8-ary  4-cube,  4-ary  6-cube,  and  binary  12-cube,  L=200bits. 


Parameter 

1) 

i  Nodes 

1 _ i 

<.  Nodes 

Dimension 

2 

5 

10 

2 

3 

4 

6 

12 

radix 

32 

4 

2 

64 

16 

8 

4 

2 

Max  Throughput 

0.36 

0.41 

0.43 

0.35 

0.31 

0.31 

0.36 

0.41 

IE3I 

128. 

233. 

70.7 

55.2 

79.9 

135. 

241. 

50.5 

161. 

269. 

73.1 

70.3 

112. 

181. 

288. 

Latency  A  =  0.3 

59.3 

221. 

317. 

78.6 

135. 

245. 

287. 

357. 

Table  1:  Maximum  Throughput  as  a  Fraction  of  Capacity  and  Blocking  Latency  in  Cycles 

When  the  network  saturates,  throughput  levels  off  as  shown  in  Figures  16  and  17.  These  figures 
show  how  much  traffic  is  delivered  (vertical  axis)  when  the  nodes  attempt  to  inject  a  given 
amount  of  traffic  (horizontal  axis).  The  curve  is  linear  (actual  =  attempted)  until  saturation 
is  reached.  From  this  point  on,  actual  traffic  is  constant.  This  plateau  occurs  because  (1)  the 
network  is  source  queued,  and  (2)  messages  that  encounter  contention  are  blocked  rather  than 
aborted.  In  networks  where  contention  is  resolved  by  dropping  messages,  throughput  usually 
decreases  beyond  saturation. 

To  find  the  maximum  throughput  of  the  network,  the  source  service  time,  To,  is  set  equal  to 
the  reciprocal  of  the  message  rate,  Ag,  and  equations  (19),  (20),  and  (21)  are  solved  for  A g. 
The  maximum  throughput  as  a  fraction  of  capacity  for  k- ary  n-cubes  with  IK  and  4K  nodes  is 
tabulated  in  Table  1.  Also  shown  is  the  total  latency  for  L  =  200bit  messages  at  several  message 
rates.  The  table  shows  that  the  additional  latency  due  to  blocking  is  significantly  reduced  as 
dimension  is  decreased. 

In  networks  of  constant  bisection  width,  the  latency  of  low-dimensional  networks  increases  more 
slowly  with  applied  traffic  than  the  latency  of  high-dimensional  networks.  At  A  *  0.2,  the  32- 
ary  2-cube  has  »  |  the  latency  of  the  binary  10-cube.  At  this  point,  the  additional  latency 
due  to  contention  in  the  32-ary  2-cube  is  7Te  compared  to  64Tc  in  the  binary  10-cube.  At 
moderate  loads,  low-dimensional  networks  may  outperform  higher-dimensional  networks  with 
lower  zero-load  latency.  For  example,  a  16-ary  3-cube  has  lower  zero-load  latency  than  a  64-ary 
2-cube  (47.5  vs.  69.25).  However,  the  64-ary  2-cube  has  lower  latency  when  A  =  0.3  (78.6  vs 
135). 

Intuitively,  low-dimensional  networks  handle  contention  better  because  they  use  fewer  channels 
of  higher  bandwidth  and  thus  get  better  queueing  performance.  The  shorter  service  times,  fc, 
of  these  networks  results  in  both  a  lower  probability  of  collision,  and  a  lower  expected  waiting 
time  in  the  event  of  a  collision.  Thus  the  blocking  latency  at  each  node  is  reduced  quadratically 
as  k  is  increased.  Tow-dimensional  networks  require  more  hops,  D  =  ,  and  have  a  higher 

rate  on  the  continuing  channels,  Ac-  However,  messages  travel  on  the  continuing  channels  more 
frequently  than  on  the  entering  channels,  thus  most  contention  is  with  the  lower  rate  channels. 
Having  fewer  channels  of  higher  bandwidth  also  improves  hot-spot  throughput  as  described 
below. 
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3.3  Hot  Spot  Throughput 


In  many  situations  traffic  is  not  uniform,  but  rather  is  concentrated  into  hot  spots.  A  hot  spot 
is  a  pair  of  nodes  that  accounts  for  a  disproportionately  luge  portion  of  the  total  network 
traffic.  As  described  by  Pfister  [16]  for  a  shared-memory  computer,  hot-spot  traffic  can  degrade 
performance  of  the  entire  network  by  causing  congestion. 

The  hot-spot  throughput  of  a  network  is  the  maximum  rate  at  which  messages  can  be  sent 
from  one  specific  node,  P,,  to  another  specific  node,  P}.  For  a  k-ary  n-cube  with  deterministic 
routing,  the  hot-spot  throughput,  9hs»  is  just  the  bandwidth  of  a  single  channel,  W.  Thus, 
under  the  assumption  of  constant  wire  cost  we  have 

0HS  *  W  »  k  —  1.  (24) 

Low-dimensional  networks  have  greater  channel  bandwidth  and  thus  have  greater  hot-spot 
throughput  than  do  high-dimensional  networks.  Intuitively,  low-dimensional  networks  operate 
better  under  non-uniform  loads  because  they  do  more  resource  sharing.  In  an  interconnection 
network  the  resources  are  wires.  In  a  high-dimensional  network,  wires  are  assigned  to  particular 
dimensions  and  cannot  be  shared  between  dimensions.  For  example,  in  a  binary  n-cube  it  is 
possible  for  a  wire  to  be  saturated  while  a  physically  adjacent  wire  assigned  to  a  different 
dimension  remains  idle.  In  a  torus  all  physically  adjacent  wires  are  combined  into  a  single 
channel  that  is  shared  by  all  messages  that  must  traverse  the  physical  distance  spanned  by  the 
channel. 


4  Conclusion 


Under  the  assumption  of  constant  wire  bisection,  low-dimensional  networks  with  wide  channels 
provide  lower  latency,  less  contention,  and  higher  hot-spot  throughput  than  high-dimensional 
networks  with  narrow  channels.  Minimum  network  latency  is  achieved  when  the  network  radix, 
k,  and  dimension,  n,  are  chosen  to  make  the  components  of  latency  due  to  distance,  D ,  and 
aspect  ratio,  £  approximately  equal.  The  minimum  latency  occurs  at  a  very  low  dimension,  2 
for  up  to  1024  nodes. 

Low  dimensional  networks  reduce  contention  because  having  a  few  high-bandwidth  channels 
results  in  more  resource  sharing  and  thus  better  queueing  performance  than  having  many  low- 
bandwidth  channels.  While  network  capacity  and  worst-case  blocking  latency  are  independent 
of  dimension,  low-dimensional  networks  have  a  higher  maximum  throughput  and  lower  average 
blocking  latency  than  do  high-dimensional  networks.  Improved  resource  sharing  also  gives 
low-dimensional  networks  higher  hot-spot  throughput  than  high-dimensional  networks. 

The  results  of  this  paper  have  all  been  made  under  the  assumption  of  constant  channel  delay, 
independent  of  channel  length.  The  main  result,  that  low-dimensional  networks  give  minimum 
latency,  however,  does  not  change  appreciably  when  logarithmic  or  linear  delay  models  are 
considered.  In  choosing  a  delay  model  one  must  consider  how  the  delay  of  a  switching  node 
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compares  to  the  delay  of  a  wire.  Current  VLSI  routing  chips  [7]  have  delays  of  tens  of  nanosec¬ 
onds,  enough  time  to  drive  several  meters  of  wire.  For  such  systems  a  constant  delay  model  is 
adequate.  As  chips  get  faster  and  systems  get  larger,  however,  a  linear  delay  model  will  more 
accurately  reflect  system  performance. 

Fat-tree  networks  have  been  shown  to  be  universal  in  the  sense  that  they  can  efficiently  simulate 
any  other  network  of  the  same  volume  [13].  However,  the  analysis  of  these  networks  has  not 
considered  latency,  fc-ary  n-cubes  with  appropriately  chosen  radix  and  dimension  are  also 
universal  in  this  sense.  A  detailed  proof  is  hevnnd  the  srope  <->f  this  paper.  Intuitively,  one 
cannot  do  any  better  than  to  All  each  of  the  three  physical  dimensions  with  wires  and  place 
switches  at  every  point  of  intersection.  Any  point-to-point  network  can  be  embedded  into  such 
a  3-D  mesh  with  no  more  than  a  constant  increase  in  wiring  length. 

This  paper  has  considered  only  direct  networks  [17].  The  results  do  not  apply  to  indirect  net¬ 
works.  The  depth  and  the  switch  degree  of  an  indirect  network  are  analogous  to  the  dimension 
and  radix  of  a  direct  network.  However,  the  bisection  width  of  an  indirect  network  is  indepen¬ 
dent  of  switch  degree.  Because  indirect  networks  do  not  exploit  locality  it  is  not  possible  to 
trade  off  diameter  for  bandwidth.  There  is  little  reason  to  construct  an  indirect  network.  A 
high- bandwidth  direct  network  would  provide  the  same  function  with  increased  performance: 

The  low- dimensional  fc-ary  n-cube  provide  a  very  general  communication  media  for  digital  sys¬ 
tems.  These  networks  have  been  developed  primarily  for  message-passing  concurrent  computers. 
They  could  also  be  used  in  place  of  a  bus  or  indirect  network  in  a  shared-memory  concurrent 
computer,  in  place  of  a  bus  to  connect  the  components  of  a  sequential  computer,  or  to  connect 
subsystems  of  a  special  purpose  digital  system.  With  VLSI  communication  chips  the  cost  of 
implementing  a  network  node  is  comparable  to  the  cost  of  interfacing  to  a  shared  bus,  and  the 
performance  of  the  network  is  considerably  greater  than  the  performance  of  a  bus. 

The  Torus  Routing  Chip  (TRC)  is  a  VLSI  chip  designed  to  implement  low-dimensional  k- 
ary  n-cube  interconnection  networks  [7].  The  TRC  performs  wormhole  routing  in  arbitrary 
k- ary  n-cube  interconnection  networks.  This  self-timed  chip  was  functional  on  first  silicon.  A 
single  TRC  provides  8-bit  data  channels  in  two  dimensions  and  can  be  cascaded  to  add  more 
dimensions  or  wider  data  channels.  A  TRC  network  can  deliver  a  150-bit  message  in  a  1024 
node  32-ary  2-cube  with  an  average  latency  of  7.5/m,  an  order  of  magnitude  better  performance 
than  would  be  achieved  by  a  binary  n-cube  with  bit-serial  channels.  A  new  routing  chip,  the 
Network  Design  Frame  (NDF),  currently  under  development,  is  expected  to  improve  this  latency 
to  ss  1/is. 

Now  that  the  latency  of  communication  networks  has  been  reduced  to  a  few  microseconds  the 
latency  of  the  processing  nodes,  Taod*,  dominates  the  overall  latency.  To  efficiently  make  use  of 
a  low-latency  communication  network  we  need  a  processing  node  that  interprets  messages  with 
very  little  overhead.  The  design  of  such  a  message-driven  processor  is  currently  underway  [5] 
[9]. 

The  real  challenge  in  concurrent  computing  is  software.  The  development  of  concurrent  soft¬ 
ware  is  strongly  influenced  by  available  concurrent  hardware.  We  hope  that  by  providing 
machines  with  higher  performance  internode  communication  we  will  encourage  concurrency  to 
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be  exploited  at  a  finer  grain  size  in  both  system  and  application  software. 
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Abstract 

One  of  the  main  difficulties  in  designing  algorithms  for  large  scale  parallel  machines  is 
making  sure  that  the  capacities  of  the  local  memories  are  not  exceeded.  In  this  paper,  we 
present  a  general  scheme  for  dynamically  reorganizing  memory  so  that  local  memory  con¬ 
straints  are  never  exceeded  provided  that  global  memory  constraints  are  not  exceeded.  The 
scheme  is  simple,  real-time,  space-efficient,  deterministic  and  transparent  to  the  program¬ 
mer.  It  requires  only  that  the  total  hardware  used  (i.e.,  wires  and  total  memory)  exceed 
the  number  of  local  memories  by  a  logarithmic  factor.  In  return,  the  scheme  guarantees 
an  arbitrarily  high  percentage  utilization  of  the  total  memory,  independent  of  whatever 
local  demands  for  memory  arise.  We  analyze  the  behaviour  of  our  scheme  in  worst-case 
and  average-case  settings,  and  we  show  that  it  is  optimal  in  many  respects,  even  when 
compared  to  randomized  algorithms. 


1  Introduction 

In  very  large  scale  parallel  computers,  memory  is  distributed  in  small  chunks  among  a  large 
number  of  processors.  Typically,  the  total  memory  of  the  system  is  quite  large,  but  each  local 
memory  is  quite  small.  Insuring  that  the  capacity  of  the  local  memories  is  not  exceeded  is  one 
of  the  main  problems  in  designing  parallel  algorithms  and  architectures.  Indeed,  it  is  often  a 
relatively  simple  matter  to  insure  that  total  memory  capacity  is  not  exceeded,  but  fluctuations 
in  the  demand  for  local  resources  invariably  arise  that  make  allocation  of  memory  in  certain 
“hot  spots”  much  more  difficult  to  handle. 

As  an  example,  consider  the  problem  of  routing  a  permutation  of  N  packets  on  an  N- 
processor  machine.  At  the  beginning  and  the  end  of  the  routing,  each  processor  has  precisely 
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one  packet.  Depending  on  the  algorithm  used  to  route  the  packets,  however,  large  numbers  of 
packets  might  bunch  up  at  a  few  nodes  during  intermediate  stages  of  the  routing.  For  example, 
up  to  y/X)  packets  might  bunch  up  at  a  single  node  if  an  oblivious  routing  strategy  is  used 
[BH],  Although  we  could  handle  the  local  space  crunch  by  allocating  0(vCV)  capacity  to  each 
local  memory,  it  would  be  terribly  wasteful  to  do  so  since  we  would  only  be  using  a  total  of 
X  out  of  ©(iV3/2)  space  overall  at  any  point  in  time. 

A  variety  of  approaches  are  used  in  practice  to  try  to  overcome  the  problems  associated 
with  localization  of  memory.  These  approaches  include:  allocation  of  •‘extra"  space  at  each 
processor,  misallocation  (i.e.,  sending  data  where  it  doesn't  want  to  go,  but  where  there  is 
space),  destruction  of  overflow  data,  freezing  movement  of  data  until  space  becomes  available, 
and  randomizing  the  desired  location  of  data  by  (for  example)  hashing  the  memory  on  a 
global  basis.  Each  of  these  techniques  has  its  advantages  and  disadvantages.  Probably  the 
most  popular  is  randomization  which  has  been  applied  quite  successfully  to  packet  routing 
problems  [VB]  [R],  but  even  here  there  is  no  iron-clad  guarantee  of  success,  and  operations 
such  as  hashing  the  entire  memory  involve  a  great  deal  of  overhead.  Moreover,  randomization 
has  only  been  shown  to  work  for  certain  very  specific  problems. 

In  this  paper,  we  present  a  general  scheme  for  automatically  reorganizing  memory  so 
that  local  memory  bounds  are  never  exceeded  provided  that  global  memory  bounds  are  not 
exceeded.  The  scheme  is  real-time,  space-efficient,  deterministic,  transparent  to  the  program¬ 
mer,  and  perhaps  best  of  all,  quite  simple.  It  requires  only  that  the  total  hardware  used  (i.e.. 
wires  and  total  memory)  exceed  the  number  of  local  memories  being  simulated  by  a  logarith¬ 
mic  factor.  In  return,  the  scheme  guarantees  an  arbitrarily  high  percentage  utilization  of  the 
total  memory,  independent  of  whatever  local  demands  for  memory  arise. 

Our  model  assumes  that  each  local  memory  can  be  treated  as  a  linear  array  that  is  accessed 
through  a  port  associated  with  a  single  processor  in  the  network.  This  is  somewhat  restrictive 
in  that  we  do  not  allow  a  processor  to  have  random  access  to  its  local  memory,  but  it  is  general 
enough  to  include  memories  that  are  queues,  stacks,  priority  queues,  and  the  like.  We  break 
up  memory  and  data  into  fundamental  units  called  packets.  We  assume  that  one  packet  can 
traverse  a  single  wire  in  a  single  unit  of  time.  Packets  can  be  thought  of  as  bits,  bytes,  or 
entire  blocks  of  data.  The  arrival  and  departure  of  data  packets  through  a  port  is  governed 
by  a  parameter  A  that  is  usually  assumed  to  be  a  constant.  In  particular,  we  assume  that  no 
more  than  [At]  packets  can  arrive  or  depart  through  a  port  in  t  consecutive  steps.  Lastly,  we 
let  m  denote  the  number  of  local  memories,  and  p  denote  the  total  number  of  packets  in  all 
the  memories.  Note  that  the  p  packets  may  or  may  not  be  distributed  evenly  among  the  m 
processors.  Indeed,  all  p  packets  may  be  contained  in  a  single  local  memory. 

The  task  facing  us  is  to  design  a  network  and  algorithm  to  simulate  any  action  of  m  local 
memories  subject  to  the  preceding  constraints.  To  effect  the  simulation,  we  will  construct  an 
IV-node  degree- d  network  where  each  processor  has  a  local  memory  of  size  q.  Of  course,  if  q  is 
as  large  as  p,  the  simulation  is  trivial.  The  object  is  to  make  N ,  d  and  q  as  small  as  possible 
relative  to  m,  A  and  p.  In  this  paper,  we  describe  a  butterfly-based  construction  for  which 
A  =  0(1),  d  =  0(1),  q  =  0(1)  and  N  =  max(p,  m  log  m),  and  a  hypercube- based  construction 
for  which  A  =  0(1),  d  =  0(logiV),  q  =  0(logiV)  and  N  =  max(m,  p/  log  m). 
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Both  constructions  are  optimal  in  the  sense  that  the  total  storage  needed  for  the  simulation 
(.Vg)  is  only  slightly  larger  than  the  trivial  lower  bound  of  p  when  p  is  large,  and  in  the  sense 
that  the  simulations  are  real  time.  The  butterfly-based  construction  is  also  optimal  in  the  sense 
that  any  bounded- degree  simulator  must  have  fi(mlog  m)  total  memory,  even  if  p  is  small  and 
even  if  randomized  algorithms  are  used.  Whether  or  not  a  simulator  can  be  constructed 
with  fewer  nodes  and  wires  (say  m)  but  with  more  local  memory  per  node  (say  logm)  when 
p  =  m  log  m  remains  tin  open  question.  We  show  that  such  a  simulator  does  exist  when  p  =  m. 
but  this  result  is  less  interesting  since  it  is  not  space-efficient. 

Our  simulations  are  based  on  tin  efficient  solution  to  a  specialized  routing  problem  that 
we  call  isotone  routing.  In  particular,  an  isotone  routing  problem  is  one  for  which  the  relative 
order  of  the  elements  being  moved  is  unchanged.  For  example,  the  mapping  {1  — ►  3.2  — 
4,6  — ♦  5,7  — ♦  8}  is  an  isotone  partial  permutation.  We  suspect  isotone  routing  problems  will 
turn  out  to  be  useful  in  other  applications  as  well.  As  a  simple  example,  we  show  how  to  use 
our  isotone  routing  algorithm  to  route  any  permutation  problem  of  size  N  on  an  N  log  .V-node 
butterfly  in  0(log2  /  log  log  N)  steps,  slightly  improving  the  previous  best  known  deterministic 
bound  of  ©(log2  N) 

The  remainder  of  the  paper  is  divided  into  four  sections.  Our  memory  reallocation  scheme 
and  related  constructions  are  presented  in  Section  2.  Section  3  contains  the  lower  bounds 
and  proofs  of  optimality.  Average-case  results  are  dicussed  in  Section  4.  Here  it  is  interesting 
to  note  that  although  randomized  algorithms  fare  no  better  than  deterministic  algorithms  on 
worst-case  problems,  an  average-case  problem  is  substantially  easier  to  handle  than  a  worst- 
case  problem.  We  conclude  with  some  open  questions  and  subjects  of  ongoing  research  in 
Section  5. 

Due  to  the  constraints  on  length,  proofs  and  descriptions  of  constructions  are  severely 
abbreviated.  We  will  also  restrict  our  attention  to  the  special  case  when  each  local  memory 
is  a  simple  queue. 

2  Constructions 

2.1  Isotone  Routing 

The  building  block  of  the  space-efficient  queueing  networks  discussed  in  this  section  is  a 
butterfly- based  network  which  can  route  a  special  class  of  partial  permutations  of  {1, 2, . . . ,  m} 
deterministically  and  on-line  in  ©(logm)  steps.  The  network  in  Figure  1  (for  m  =  8)  can 
perform  such  routing  for  isotone  partial  permutations;  we  say  that  a  partial  permutation  of 
{1, 2, . . . ,  m}  is  isotone  if  the  mapping  from  sources  to  destinations  is  strictly  isotone  (since  we 
are  dealing  with  a  partial  permutation,  this  mapping  is  injective,  so  that  the  mapping  being 
isotone  is  equivalent  to  its  being  strictly  isotone).  Note  that  the  middle  row  of  columns  is  not 
really  necessary  (it  is  included  to  simplify  the  algorithm  description),  and  that  this  network, 
aside  from  the  long  vertical  edges,  is  just  a  column  permutation  of  the  Benes  back-to-back 
butterfly  network. 
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The  network  can  route  an  isotone  partial  permutation  of  {1.2 . m}  as  follows.  The 

upper  butterfly  is  used  to  count  the  sources,  and  to  assign  to  each  source  its  position  in  the 
left-to-right  ordering  of  sources  (0  for  the  leftmost  source.  1  for  the  next...):  this  can  be  done 
using  a  prefix  calculation,  and  an  extra  step  to  pass  the  values  back  to  the  top  —  a  total  of 
log  m  + 1  steps.  Then  in  2  log  m  steps,  each  source  can  be  routed  to  the  position  in  the  middle 
row  determined  by  this  value  and  then  on  to  its  destination  without  collisions.  That  there 
are  no  collisions  comes  from  the  fact  that,  since  we  are  packing  all  the  sources  to  the  left, 
we  are  in  effect  performing  two  instances  (one  in  reverse)  of  deterministic  on-line  0-1  routing 
(given  0’s  and  l’s  distributed  among  the  sources,  route  them  so  that  all  the  0's  are  to  the  left 
of  all  the  l’s),  for  which  it  is  known  that  no  collisions  occur  for  routing  using  straightforward 
bit-flipping.  Thus  the  isotone  partial  permutation  is  routed  in  3  logm  +  1  steps. 

Note  that  if  the  sources  are  the  same  for  a  sequence  of  j  isotone  partial  permutations  (or 
even  if  each  set  of  sources  is  included  in  the  previous  set),  then  we  can  use  the  results  of  the 
first  prefix  computation  to  route  all  j  permutations  one  after  the  other,  for  a  total  time  of 
(logm  +  1)  4-  (21ogm  -f-  j  —  1)  =  j  +  31ogm.  In  particular,  when  each  member  of  the  set 
of  sources  Sx  <  s2  <  . . .  <  s,  has  some  discrete  interval  of  destinations  of  length  at  most  j 
({dt,  d,  +  1, . . ,  ,di  +  ji  —  1},  j,  <  j),  where  for  all  k,  dk+  jk  —  1  <  dk+ 1,1  (all  the  destinations 
of  9k  are  less  than  all  the  destinations  of  Sk+i ),  the  above  conditions  are  satisfied,  and  the 
routing  can  be  accomplished  in  j  -f  3 logm  steps. 

Note  also  that  by  ‘folding  up’  the  two  butterflies  in  the  network,  and  identifying  the 
resulting  top  and  bottom  rows  of  processors,  we  could  reduce  the  size  of  the  network  to  m  log  m. 
and  allow  it  to  perform  isotone  routing  in  3 logm  steps.  However,  this  would  eliminate  any 
possibility  of  pipelining  as  described  above,  making  the  current  configuration  a  better  choice 
for  the  queueing  algorithm,  which  relies  heavily  on  such  pipelining.  Also,  adding  another  set 
of  long  edges  across  the  lower  butterfly  will  allow  us  perform  such  routing  in  both  directions 
on  the  network. 


2.2  Queueing  Networks  and  Algorithms 

The  queue  management  problem  can  be  solved  for  q  =  0(1),  N  =  0(m  log  m),  A  <  1  and  p 
an  arbitrarily  large  constant  fraction  of  qN  on  a  butterfly-based  network. 

In  its  most  basic  form,  this  network  consists  of  m  linear  arrays  of  processors  of  size  C  log  m 
(discussion  of  the  C  and  the  other  constants  involved  in  the  construction,  which  affect  the 
values  of  A  for  which  the  algorithm  will  be  correct,  is  omitted  here),  each  with  one  of  the  m 
ports  at  one  end,  and  at  the  other  end  a  connection  to  a  bidirectional  isotone  routing  network. 
At  the  bottom  of  the  isotone  routing  network  are  m  linear  arrays  of  size  logm.  For  example, 
see  Figure  2. 

The  idea  is  that  each  of  the  m  queues  has  some  number  (between  (u  —  v )  log  m  and 
(u  +  v+  1)  log  m)  of  its  elements  stored  in  its  upper  linear  array,  and  the  excess  stored  in  some 
range  of  the  lower  linear  arrays.  The  algorithm  cycles  repeatedly,  each  time  adjusting  the 
number  of  elements  in  each  upper  linear  array  by  either  sending  some  multiple  of  log  m  queue 
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elements  down  to  the  lower  linear  arrays,  or  bringing  up  some  multiple  of  log  m  elements  from 
the  lower  linear  arrays. 

Each  cycle  consists  of  four  phases  (presented  here  in  order  of  execution:  in  an  enhanced 
version  of  the  network  and  algorithm,  further  parallelism  is  used  to  sharpen  results): 

1.  Overhead,  in  which  it  is  determined  for  each  processor  whether  queue  elements  will  be 
sent  down  or  retrieved,  and  its  new  range  of  lower  linear  arrays  is  calculated. 

2.  Retrievals,  where  all  queues  which  have  become  ‘too  small'  simultaneously  bring  elements 
from  their  ranges  of  lower  linear  arrays  back  to  their  upper  linear  arrays,  using  isotone 
routing. 

3.  Shifts,  where  those  elements  remaining  in  the  lower  linear  arrays  are  moved  to  their  new 
positions  (as  calculated  in  the  Overhead  phase),  in  preparation  for  new  elements  to  be 
sent  down. 

4.  Sends,  where  all  queues  which  have  become  ‘too  big’  simultaneously  send  elements  form 
their  upper  linear  arrays  down  to  their  ranges  of  lower  linear  arrays. 

Analysis  of  the  time  required  by  these  cycles  of  the  algorithm  gives  us  constraints  on  the 
values  of  A  for  which  the  algorithm  will  work  in  terms  of  u,  v,  and  C,  and  allows  us  to  find  the 
optimal  values  of  u  and  C  for  a  given  v.  For  the  basic  network  and  algorithm,  we  will  always 
have  A  <  A,  but  we  also  describe  an  enhanced  version  with  a  higher  degree  of  parallelism  which 
allows  us  to  get  A  arbitrarily  close  to  1  by  choosing  sufficiently  large  v.  Although  to  simplify 
the  explanation,  this  algorithm  is  explained  in  terms  of  managing  stacks,  adjustments  can  be 
made  to  allow  us  to  simulate  general  linear  arrays  (i.e.,  stacks,  priority  queues,  FIFO,  etc.). 

We  also  describe  a  hypercube-based  algorithm,  obtained  by  identifying  entire  columns  of 
the  butterfly-based  network  into  single  nodes.  This  network  solves  the  problem  with  q  = 
0(logm)  and  N  —  m. 

In  their  paper  on  the  token  distribution  problem,  Peleg  and  Upfal  describe  an  n-node 
bounded-degree  network  which  can,  with  0(logn)  local  storage,  solve  routing  problems  for  up 
to  n  packets  in  0(log  n)  time,  as  long  as  no  source  or  destination  node  has  multliplicity  greater 
than  log  n.  One  might  try  replacing  the  isotone  routing  network  with  an  m-node  network  of 
this  type  and  transforming  each  sub-array  of  log  m  nodes  in  the  linear  arrays  into  a  single 
node  with  logm  size  internal  storage,  yielding  a  bounded- degree  network  with  q  =  0(logm) 
and  N  =  0(m)  to  solve  the  queueing  problem.  However,  since  the  routing  network  has  no 
pipelining  abilities,  we  would  have  to  restrict  the  total  number  of  packets  in  the  network  to 
be  ;V  =  m,  only  using  of  the  available  space. 

2.3  An  Application  of  Isotone  Routing  to  General  Permutation 
Routing 

There  is  a  deterministic  algorithm  for  routing  a  partial  permutation  on  an  n-node  hypercube 
in  0(  to°^g n )  steps  [K]  [L]  which,  using  isotone  routing,  yields  a  partial  permutation  routing 
algorithm  on  the  butterfly  which  runs  in  the  same  number  of  steps. 
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The  hypercube  algorithm  goes  as  follows:  Route  greedily  for  the  first  log  log  n  bits.  At 
this  poiut.  each  iogn-node  subcube  defined  by  a  setting  of  these  log  log  n  bits  can  contain  at 
most  log  n  packers.  Using  isotone  routing  we  can  spread  out  these  packets  in  their  respective 
subcubes,  at  most  one  to  each  node,  in  0(log  n)  steps;  repeat  this  process  for  each  subproblem. 
Thus  the  total  running  time  is  ^^^©(logn)  = 

On  the  butterfly,  the  corresponding  algorithm  is  as  follows:  route  greedily  for  the  first 
log  log  n  levels  of  the  butterfly;  then,  as  before,  each  sub-butterfly  defined  by  a  setting  of  these 
log  log  n  bits  can  contain  at  most  logn  packets.  Thus  we  can  use  isotone  routing  to  distribute 
these  packets  evenly  within  the  sub-butterfly  in  O(logn)  steps;  repeat  this  process  on  each 
sub-butterfly.  As  before,  the  total  running  time  is  Q(1(^,*n).  The  key  difference  is  that  we 
must  simulate  the  log  n-size  queues  of  the  hypercube  with  constant-size  queues  contained  in 
the  corresponding  nodes  of  the  butterfly.  The  details  are  not  difficult  to  work  out. 

3  Lower  Bounds 

Here  we  prove  lower  bounds  on  the  space  needed  to  solve  the  memory  management  problem, 
for  both  deterministic  and  randomized  algorithms.  In  looking  for  lower  bounds,  we  consider 
an  off-line  formulation  of  the  problem  where  we  are  given  a  final  queue  length  for  each  port  in 
the  network  and  the  problem  is  to  find  m  slot-disjoint  queues  of  the  appropriate  lengths  from 
the  m  ports  in  the  network;  nodes  of  the  network  have  random  access  to  their  slots.  (A  slot 
is  a  single  location  in  a  local  memory  of  the  simulating  network.)  Since  the  general  queueing 
problem  is  at  least  as  hard  as  the  off-line  problem,  any  lower  bounds  found  for  the  off-line 
problem  will  apply  to  the  on-line  problem. 

It  is  immediate  from  the  problem  definition  that  m  <  N  and  p  <  qN;  also,  we  must  have 
q  >  [A]  >  A,  since  there  must  be  room  in  a  port  to  hold  all  the  packets  which  could  arrive  (or 
be  requested)  in  one  time  step.  In  addition,  for  r  =  the  length  of  the  longest  queue  and  T  = 
the  number  of  time  steps,  we  must  have  r  <  [A T). 

3.1  The  Deterministic  Case 

Lemma  ID:  Let  a  queue-finding  algorithm  on  an  IV-node  network  with  m  ports  be  given.  If 
mr  >  qN  and  Qrcff  <  p,  then  there  is  some  assignment  of  queue  lengths  to  the  m  ports  for 
which  the  algorithm  will  not  find  disjoint  queues  of  the  appropriate  lengths  in  the  network. 

Sketch  of  Proof:  Assume  that  mr  >  qN  and  6rcff  <  p.  If  these  constraints  are  satisfied, 
then  we  can  force  two  ports  to  select  queues  which  intersect  in  some  slot  as  follows: 

Since  the  total  number  of  nodes  at  distance  no  greater  than  T  from  a  given  node  is  at  most 
Zcff ,  each  port’s  choice  of  queue  can  depend  on  the  values  in  at  most  3<ff  ports.  For  each  of 
the  m  ports,  consider  the  queue  of  length  r  chosen  when  all  Zdff  of  the  ports  within  distance 
T  are  assigned  the  value  r.  The  total  length  of  these  queues  is  mr;  since  mr  >  qN,  we  can 
choose  two  ports  for  which  these  queues  intersect  at  some  slot.  We  assign  values  to  ports  as 
follows:  assign  to  the  two  ports  chosen  above  and  to  each  port  within  distance  T  of  either  of 
them  the  value  r;  this  is  permissible  since  it  involves  at  most  6cff  ports,  and  6rdT  <  p  (there 
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are  enough  packets  to  go  around).  Assign  any  extra  packets  arbitrarily.  This  assignment  of 
values  to  ports  will  cause  the  algorithm  to  fail  by  choosing  intersecting  queues,  thus  proving 
the  Lemma. 

Lemma  2D:  If  4qN\ogd  <  Am  log  then  there  is  no  correct  queue-finding  algorithm. 

Sketch  of  Proof:  Assume  that  AqNlogd  <  Am  log  and  let  a  queue-finding  algorithm 

be  given.  Then  it  can  be  shown  r  =  [^  +  lj  and  T  =  satisfy  r  <  [AT]  and  the 

conditions  of  Lemma  ID.  It  follows  that  the  queue-finding  algorithm  must  fail  on  some  input. 
Therefore  no  correct  queue-finding  algorithm  exists. 

The  following  theorem  shows  a  lower  bound  on  the  size  of  a  family  of  N  =  .V(m)-node 
networks  with  m  ports,  with  a  total  queue  length  of  p  =  p(m). 

Theorem:  Let  N  =  N(m),  and  p  =  p(m)  >  (qN(m)Y  for  some  e  €  (0,1]  and  sufficiently 
large  m.  If  qNlogd  =  o(m  log  m)  then  no  family  of  JV-node  networks  with  correct  queue¬ 
finding  algorithms  for  p  packets  exists. 

Sketch  of  Proof:  Assume  that  the  above  conditions  hold,  and  let  a  family  of  iV-node  networks 
be  given.  Suppose  that  this  family  had  correct  queue-finding  algorithms.  Since  qN  log  d  = 
o(m  log  m ),  we  have  that  for  all  c  >  0  and  for  sufficiently  large  m,  qN  log  d  <  cm  log  m.  Choose 
ci  >  0  such  that  for  sufficiently  large  m,  cj  <  j(e  +  (e  -  ),  and  choose  c2  >  0  such 

that  c2  <  12^  (if  e  =  1,  choose  any  c2  >  0). 

For  these  constants  and  sufficiently  large  m,  4qN  log  d  <  Am  log  implying  that  (by 
Lemma  2D)  no  correct  queue-finding  algorithm  can  exist  for  sufficiently  large  m.  This  con¬ 
tradicts  that  this  is  a  family  of  networks  with  correct  queue-finding  algorithms.  We  conclude 
that  no  such  family  exists. 

3.2  The  Randomized  Case 

For  the  randomized  lower  bound  we  use  the  most  general  version  of  an  off-line  randomized 
algorithm,  where  there  are  an  arbitrary  number  of  public  random  bits  which  all  processors 
can  access  and,  as  before,  each  port  has  a  value  which  is  its  final  queue  length. 

Claim:  Suppose  mr  >  3 qN.  Then  for  any  set  of  r-length  queues  from  the  m  ports,  there 
exist  distinct  ports  pi,p2, . .  .,pip.  such  that  for  all  i  €  1,2, . . . ,  y,  the  queues  from  p2,_1  and 
p2t  intersect  in  some  slot. 

Sketch  of  Proof:  Since  mr  >  3 qN,  we  have  that  [m  -  2(t  -  l)]r  >  qN  for  t  6  1,2, .... 
Thus  for  i  =  1, 2, . . . ,  y  we  can  find  an  intersecting  pair  of  queues,  and  remove  them  and  their 
ports  from  consideration.  This  will  yield  j  distinct  pairs  of  ports  whose  queues  intersect,  as 
desired. 

Claim:  Suppose  mr  >  3 qN.  Then  given  any  randomized  queue-finding  algorithm  (and 
sufficiently  large  m),  there  is  some  set  of  2 mi+s  ports  (for  any  6  6  (0,  |))  for  which,  when  the 
values  of  all  ports  looked  at  during  the  computation  are  r,  we  have  that  Pr(the  algorithm  fails 
for  bit  setting  B)  is  exponentially  close  to  1. 
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Sketch  of  Proof:  Suppose  mr  >  3 qN,  and  let  a  randomized  queue-finding  algorithm  and  a 
setting  of  the  random  bits  B  be  given. 

We  consider  the  collision  graph  G.  whose  nodes  are  the  m  ports,  and  for  which  there  is 
an  edge  between  x  and  y  in  G  if  and  only  if  the  queues  from  ports  x  and  y  intersect  when  all 
values  seen  in  other  ports  are  r,  and  the  random  bits  are  set  according  to  B.  By  the  previous 
Claim,  there  is  a  set  of  j  independent  edges  (edges  with  no  common  endpoints)  in  G. 

We  then  show  that  for  a  random  set  of  2 m?+5  ports,  the  probability  that  the  induced 
subgraph  of  the  collision  graph  contains  at  least  one  of  these  edges  is  exponentially  close  to 
1.  This  means  that  for  almost  all  sets  of  2 m^+4  ports,  if  all  the  ports  looked  at  by  any  of 
these  ports  had  value  r,  then  the  algorithm  would  choose  some  pair  of  intersecting  queues 
for  bit  setting  B.  Therefore  for  each  possible  bit  setting  B ,  all  but  exponentially  few  of  the 
possible  sets  of  2mi+5  ports  cover  an  edge  in  the  collision  graph,  so  that  there  must  be  some 
set  of  2 m?+<5  ports  which  covers  an  edge  in  the  collision  graph  for  all  but  exponentially  few 
bit  settings.  This  is  the  set  required  by  the  Claim.  In  fact,  the  same  property  holds  for  all 
but  exponentially  few  of  the  sets  of  2 m^racn+6  ports. 

From  this  point,  the  argument  follows  the  same  program  as  the  proof  of  the  deterministic 
lower  bound. 

Lemma  1R:  Let  a  randomized  queue-finding  algorithm  on  an  iV-node  network  with  m  ports 
be  given.  If  mr  >  3 qN  and  6m  \+sr(F  <  p,  then  there  is  some  assignment  of  queue  lengths  to 
the  m  ports  for  which  the  algorithm  will  fail  (that  is,  not  find  disjoint  queues  of  the  appropriate 
lengths  in  the  network)  with  probability  exponentially  close  to  1. 

Sketch  of  Proof:  Assume  that  mr  >  3 qN  and  6m^+sr({r  <  p.  Consider  a  set  of  2m  b+6  for 
which  all  but  exponentially  few  of  the  possible  settings  of  the  random  bits  cause  the  algorithm 
to  select  intersecting  queues  when  the  values  in  all  the  ports  looked  at  during  the  computation 
are  r.  As  in  the  determinisitc  argument,  a  port  can  learn  the  values  in  at  most  3d?  ports  in 
T  steps. 

Thus  by  assigning  the  value  r  to  the  6 ports  which  can  be  looked  at  during  the 
course  of  the  computation  (and  assigning  any  extra  packets  arbitrarily),  we  insure  that  for  all 
but  exponentially  few  of  the  possible  settings  of  the  random  bits  intersecting  queues  will  be 
chosen.  Thus  for  this  assignment  of  values  to  ports,  the  algorithm  will  fail  with  probability 
exponentially  close  to  1. 

Lemma  2R:  If  \2qN\ogd  <  Am  log  ,  then  there  is  no  randomized  queue-finding  algo¬ 
rithm  for  which  we  cannot  find  an  input  on  which  it  fails  with  probability  exponentially  close 
to  1. 

l-S 

Sketch  of  Proof:  Assume  that  \2qN\ogd  <  Am  log  m36q!/ ,  and  let  a  randomized  queue- 

finding  algorithm  be  given.  Then  r  =  ^  =  satisfy  r  <  (AT]  and 

the  conditions  of  Lemma  1R.  It  follows  that  the  randomized  queue-finding  algorithm  must  on 
some  input  almost  certainly  fail.  Therefore  no  randomized  queue-finding  algorithm  for  which 
such  an  input  cannot  be  found  can  exist. 
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The  following  theorem  shows  a  lower  bound  on  the  size  of  a  family  of  X  =  A'(m)-node 
networks  with  m  ports,  with  a  total  queue  length  of  p  -  p(m ),  which  is  within  a  constant 
factor  of  the  lower  bound  already  found  for  deterministic  algorithms. 

Theorem:  Let  -V  =  .V(m),  and  p  =  p(m)  >  (qN{m)Y+i  for  some  e  £  (0,  5]  and  sufficiently 
large  m.  If  qX  log  d  =  o(m  log  m)  then  no  family  of  iV-node  networks  with  randomized  queue¬ 
finding  algorithms  for  p  packets  where  we  cannot  find  some  bad  input  exists. 

Sketch  of  Proof:  Assume  that  the  above  conditions  hold,  and  let  a  family  of  -V-node  networks 
be  given.  Suppose  that  this  family  had  randomized  queue-finding  algorithms  without  bad 
inputs.  Choose  6  £  (0,  e).  Since  qNlogd  =  o(m  log  m),  we  have  that  for  all  c  >  0  and 
for  sufficiently  large  m,  qN  <  cm  log  m.  Choose  c\  >  0  such  that  for  sufficiently  large  m. 
ci  <  —  6  +  (e  —  | ) ‘'fogr )’  ^is  quantity  is  positive  for  sufficiently  large  m,  since  e  >  £). 

Choose  c2  >  0  such  that  c2  <  12*~I  (if  e  =  choose  any  c2  >  0). 

For  these  constants  and  sufficiently  large  m,  \2qN\ogd  <  Am  log  ,  implying  that 
(by  Lemma  2R)  no  randomized  queue-finding  algorithm  without  a  bad  input  can  exist  for 
sufficiently  large  m.  This  contradicts  that  this  is  a  family  of  networks  with  randomized  queue¬ 
finding  algorithms  free  of  bad  inputs.  We  conclude  that  no  such  family  exists. 

It  follows  that,  in  either  the  determinstic  or  the  randomized  case,  when  the  conditions  of 
the  theorem  hold,  we  must  have  qN  log  d  =  Q(m  log  m)  in  order  for  a  family  of  bounded- degree 
networks  with  correct  algorithms  to  exist.  If  local  storage  is  constant,  we  must  have  N  log  d  = 
Q(m  log  m),  so  that  the  butterfly-based  construction  is  optimal  up  to  a  constant  factor.  For 
q  =  0(logm),  the  hypercube  variation  is  not  necessarily  optimal  due  to  is  logarithmic  degree. 

4  Randomized  Inputs 

In  order  to  study  the  effects  of  random  inputs,  we  use  a  Poisson  model  of  queue  arrivals,  where 
at  each  time  step,  the  number  of  packets  arriving  (or  being  requested)  at  a  port  has  a  Poisson 
distribution  with  mean  A;  note  that  this  is  a  somewhat  different  and  more  general  notion  of 
arrival  rate  than  was  used  before.  Thus  if  K  is  the  number  of  packets  arriving  or  departing 
at  a  port  during  one  time  step, 


_  . . .  e-AAr 

Pr(K  =  x)  =  - — 

X ! 

for  x  =  0, 1, 2 _ 

We  will  show  that  a  group  of  log  m  ports  will  only  with  very  small  probability  accumulate 
more  than  a  total  of  log  m  packets  In  this  case,  we  can  solve  each  of  the  instances  of  this 
smaller  problem  with  q  =  ©(loglogm),  for  a  total  of  ^^©(logmloglogm)  =  ©(mloglogm) 
space.  This  is  much  less  than  is  required  in  the  worst  case,  even  if  randomized  algorithms  are 
allowed. 
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5  Ongoing  Research 

The  butterfly-ba^ed  construction  in  section  2  shows  that  the  lower  bounds  in  section  3  are 
tight  (to  within  a  constant  factor)  when  local  storage  is  constant.  However,  -V  =  m  log  m  for 
this  construction.  Can  a  q  =  logm,  jV  =  m  network  with  bounded  degree  and  efficient  space 
usage  be  found,  or  can  the  lower  bounds  be  tightened  to  rule  out  such  a  possibilty? 

Extension  of  our  constructions  to  simulate  random  access  local  memories  is  not  possible 
without  a  degradation  in  simulation  time  (i.e.,  real-time  simulations  axe  no  longer  possible), 
but  it  might  be  interesting  if  the  results  could  be  extended  to  simulate  local  memories  that  act 
like  trees,  if  this  is  possible.  Another  area  of  research  is  to  find  the  best  achievable  behavior 
for  other  specific  networks  —  for  instance,  how  many  ports  and  packets  can  be  managed  in 
an  iV-node  mesh? 
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